Space efficient polymer sets

ABSTRACT

The disclosure features a collection that comprises a plurality of polymers, typically nucleic acid molecules in a compact form. The molecules include all possible sequences or at least a certain percentage of all possible sequences, of a particular length.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to provisional applications Ser. No.60/564,864, filed on Apr. 23, 2004, and Ser. No. 60/587,066, filed Jul.12, 2004, the contents of both of which are incorporated by reference.

BACKGROUND

Nucleic acid arrays have a variety of uses. One application enablesevaluating the binding specificity of nucleic acid binding proteins.

SUMMARY

In one aspect, the disclosure features a collection that comprises aplurality of nucleic acid molecules. The molecules include all possiblesequences or at least a certain percentage (e.g., at least 10, 25, 50,60, 70, 80, 90, 95, 98, 99, 99.9%) of all possible sequences, of aparticular length, k. These sequences are termed “k-mers.” k can be, forexample, greater or equal to 5, 6, 7, 8, 9, 10, or 12 or, e.g., between6-20, 6-18, 8-15, or 8-12. Each molecule of the plurality has lengthgreater than k, e.g., a length of k+n nucleotides. For example, n is atleast 1, 2, 3, 5, or 10. Typically, the nucleic acid molecules of theplurality are substantially longer than k, and can be at least 1.5×knucleotides long or at least an integral multiple of k, e.g., at least2×k nucleotides long, further examples of multiples include at least2.2, 2.5, 27, 3.0, 3.2, 3.5, 4, 4.5, 5, 6, 7, and ranges in between. Inone embodiment, at least some molecules of the plurality (and typicallyall molecules of the plurality) include at least two, three, four, orfive different k-mers.

In one embodiment, the number of unique nucleic acid molecules of theplurality is fewer than the number of all possible k-mer sequences or,if the collection includes only a certain percentage of such sites,fewer than the number of such sequences. The compaction ratio (definedas the number of unique molecules divided by the number of representedk-mers) can be, for example, less than 0.5, 0.2, 0.1, 0.06, 0.05, 0.01,or 0.002. Thus, in an embodiment in which all possible sites of length kare represented, the collection includes fewer molecules than the numberof all such possible sequences. This compact design is achieved byincluding more than one k-mer per molecule, while maintaining similark-mers at different molecules of the collection.

The nucleic acid molecules of the collection can be in solution, can belabelled, can be present in separate containers, in pools, or can beimmobilized, e.g., on one or more beads or on an array. The nucleic acidmolecules can each also include an invariant region, e.g., a primerbinding site and/or a spacer sequence.

In one embodiment, the nucleic acid molecules of the plurality are allless than 150 nucleotides in length, e.g., less than 120, 100, 80, 75,70, 65 or 60. For example, the nucleic acid molecules of the pluralitycan be between 30-75 or 50-70 nucleotides in length. In one embodiment,at least 1, 2, 5 or 10% of the molecules of the plurality includeartificial sequences or sequences not present in a yeast intergenicregion.

In another aspect, the disclosure features a collection that comprises aplurality of non-homogeneous polymer molecules. The molecules includeall possible sequences or at least a certain percentage (e.g., at least10, 25, 50, 60, 70, 80, 90, 95, 98, 99, 99.9%) of all possiblesequences, of a particular length, k. These sequences are termed“k-mers.” k can be, for example, greater or equal to 5, 6, 8, 10, or 12or, e.g., between 6-20, 6-18, 8-15, or 8-12. Each molecule of theplurality may have the same length, or the lengths may vary. Forexample, most of the molecules of the collection can have a length of atleast k+n subunits. n can be at least 1, 2, 3, 5, or 10. Typically, mostof the molecules of the collection are substantially longer than k,e.g., at least 2×k subunits long. In one embodiment, at least somemolecules of the plurality (and typically all molecules of theplurality) include at least two, three, four, or five different k-mers.

In one embodiment, the polymer includes a nucleic acid backbone, apeptide backbone, or a sugar backbone. For example, the polymer can be apolypeptide (e.g., a peptide of between 3-30 amino acids) or a largerpolypeptide (e.g., greater than 30 amino acids). In embodiments in whichthe polymer is a nucleic acid, the nucleic acid can be RNA, DNA, or acombination thereof. It can be double-stranded, partiallydouble-stranded, or single-stranded. For example, partiallydouble-stranded and single-stranded molecules may include tertiarystructures, e.g., hairpins, bulges and so forth. In a preferredembodiment, at least the variant region of molecules, e.g., the regionthat includes the different k-mers, in the collection isdouble-stranded.

In one embodiment, the number of unique nucleic acid molecules of theplurality is fewer than the number of all possible sites or, if thecollection includes only a certain percentage of such k-mer sequences,fewer than the number of such sequences. The compaction ratio (definedas the number of unique molecules divided by the number of representedk-mers) can be, for example, less than 0.5, 0.2, 0.1, 0.05, 0.01, or0.002. Thus, in an embodiment in which all possible sites of length kare represented, the collection includes fewer molecules than the numberof all such possible sequences. This compact design can be achieved byincluding more than one different k-mer per molecule, and by locatingsimilar k-mers (e.g., k-mers that differ by only one or two nucleotides)in different molecules of the collection.

The molecules of the collection can be in solution, can be labeled, canbe present in separate containers, in pools, or can be immobilized,e.g., on one or more beads or on an array, in cells, or attached to orcontained in viruses. The collection can be used to evaluate aninteraction between a molecule of interest and members of thecollection.

In another aspect, the disclosure features a nucleic acid array thatincludes all possible sites or at least a certain percentage (e.g., atleast 10, 25, 50, 60, 70, 80, 90, 95, 98, 99, 99.9%) of all possiblesites, of a particular length, k. These sites are termed “k-mers.” k canbe, for example, greater or equal to 5, 6, 8, 10, or 12 or, e.g.,between 6-20, 6-18, 8-15, or 8-12. The array includes a plurality ofaddresses. Most addresses of the plurality includes at least one nucleicacid molecule of length greater than k and the molecules at each addressof the plurality may have the same length, or the lengths may vary.. Incertain embodiments, most of the nucleic acid molecules of thecollection are at least k+n nucleotides long. For example, n is at least1, 2, 3, 5, or 10. Typically, most of the nucleic acid molecules of thecollection are substantially longer than k, and in preferred embodimentsmost of the nucleic acid molecules of the collection are at least 2×knucleotides long.

In one embodiment, the number of unique nucleic acid moleculesphysically associated with the array is fewer than the number of allpossible sites or, if the array includes only a certain percentage ofsuch sites, fewer than the number of such sites. The compaction ratio(defined as the number of unique molecules divided by the number ofrepresented k-mers) can be, for example, less than 0.5, 0.2, 0.1, 0.05,0.01, or 0.002. Thus, in a typical embodiment, even though all possiblesites of length k are represented, the array can have fewer uniqueaddresses than the number of all such possible sites. This compactdesign is achieved by including more than one k-mer per address, whilemaintaining similar k-mers at different addresses.

One method for producing a collection of nucleic acids described hereinincludes representing the collection as a string or a small number ofstrings). The Hamming distance function or any other distance functioncan be used to define sequence similarity among k-mers. Relatedsequences that are seen to be located in a common region of theoreticalsequence space (e.g., a “Hamming ball”) are arranged in the string, and,hence, in the array, so that they are discernable (e.g., distributed todifferent addresses) from each other and recoverable. This designenables fine discrimination for evaluating specificity. Other exemplarydistance functions include: a Euclidean distance, z-score distance,Bhattacharya distance, Mahalanobis distance, Matusita distance,divergence metric, Chemoff distance, angular metric, Earth Mover'sdistance, Hausdorff distance, City Block (Manhattan) distance, Chebychevdistance, Minkowski distance, or Canberra distance.

One method for using a collection of nucleic acid molecules describedherein is to evaluate a molecule of interest (e.g., a protein, a drug, anucleic acid aptamer), a sample, or a functional event (e.g., as abio-sensor). In one embodiment, a collection of nucleic acid moleculesin an array is to evaluate the specificity of a nucleic acid bindingcompound, e.g., a nucleic acid binding protein, e.g., a DNA bindingprotein, such as a transcription factor or an enzyme (e.g., a methylaseor restriction endonuclease, etc). In some cases, a protein or nucleicacid can be modified. For example, protein modifications includephosphorylation, ubiquitination, acetylation, and methylation. Thesemodifications are typically site specific. The protein or nucleic acidcan also be present in a complex, e.g., with other macromolecules. Forexample, a protein complex that includes at least two or threepolypeptide chains can be used. All the polypeptide chains can differor, e.g., as in a complex that includes a homodimer, some chains can bethe same.

Interactions between a compound of interest with addresses of the arrayare evaluated, and the interaction data is processed to identify one ormore sites with which the protein interacts. For example, if eachpossible k-mer is present at plurality of addresses, it is possible todeconvolve the interaction data to identify one or more k-mers thatinteract with the compound.

In one embodiment, the collection of nucleic acid molecules is used toidentify a nucleic acid of interest, e.g., an aptamer with binding orenzymatic activity, e.g., with respect to a target protein, e.g., anenzyme or cell-surface protein. For example, the protein can beassociated with a disease or disorder. A nucleic acid molecule thatbinds to and/or inhibits the target protein can be used, e.g., to detector modulate the disease or disorder.

In another embodiment, the collection of nucleic acid molecules is usedto characterize a sample (e.g. a complex sample, e.g. a sample obtainedfrom a subject, e.g., a patient). The pattern of interaction can be usedto identify or characterize the sample.

Accordingly, in one embodiment, the collection of nucleic acid moleculescan be used to characterize the binding preferences of nucleic acidbinding molecules. Since all possible DNA sequence variants can berepresented on DNA microarrays in a space- and cost- efficient manner, areduced number of individual DNA sequences and individual DNA spots canbe synthesized. This reduction also can have implications on the densityand/or number of addresses on synthesized microarrays.

Since the nucleic acid molecules are longer (e.g., significantly longer)than k so that the addition of each base to the length of anoligonucleotide (“oligo”) adds another k-mer to that oligonucleotide.For example, considering 10-mers (e.g., DNA sites 10 nucleotides long),a fifty base long oligo would contain 41 distinct 10-mers, and a sixtybase oligo would contain 51 distinct 10-mers. Compared to the use of asingle k-mer per oligo, this approach can greatly reduce the number ofmolecules required to characterize the binding, enzymatic, or otherphysical or functional interactions.

In one embodiment, the nucleic acid includes k-mers, where k is greaterthan 5, 6, 7, or 8, e.g., a biologically relevant size, and every k-meror at least the certain percentage of such k-mers, is represented in atleast two different nucleic acid molecules.

In one embodiment, in which the nucleic acids of the collection areimmobilized on a surface, those two different nucleic acid molecules fora particular k-mer would be located at distinguishable addresses. Eachk-mer in the collection can represented at least once, twice, or more(e.g., 3, 4, or 5) times with a variety of flanking bases (e.g., indifferent nucleic acid molecules). The flanking bases could be either aparticular sequence (e.g., non-degenerate) or be degenerate (e.g., Nwhich is a mixture of A, C,G, and T, or R which is a mixture of G andA). The form of this collection of k-mers could be variable. Forexample, in one embodiment, the collection is provided as an array,e.g., an array of double-stranded oligonucleotides. The oligonucleotidescan be, for example, in the range of 10 to 100 bases.

In one embodiment, each k-mer of the collection can be flanked by eachof the four nucleotides on both sides, resulting in sixteen variants.For example, considering ACGT, it would be most informative to haveAACGTA, AACGTC, AACGTG, AACGTT, CACGTA, CACGTC, CACGTG, CACGTT, GACGTA,GACGTC, GACGTG, GACGTT, TACGTA, TACGTC, TACGTG, and TACGTT representedon a DNA micro array.

If not every k-mer is present in a collection, it is possible, forexample, that the omitted k-mers have particular qualities, e.g.,homo-polymeric sequences or sequences that include only two differenttypes of nucleotides, and so forth. In certain embodiments, most k-mersare represented within the collection more than once. In preferredembodiments the sequence of a given k-mer is represented within a givenmolecule shared with other k-mers, and the sequence of this given k-meris present on at least a second molecule wherein few or none of theother k-mers present on the given first molecule are present on thesecond molecule. While wishing not to be bound by theory, it isunderstood that multiple representations of a given k-mer can allow amore precise assignment of which k-mer of a molecule is being bound orotherwise subject to an interaction, and restricting, reducing, oreliminating, occurrences of pairs of k-mers to single occurrences allowsmore precise determinations of which k-mer within a molecule is beingbound or subject to an interaction.

In certain embodiments, to allow the binding site specificity of a DNAbinding molecule to be determined completely and readily from acollection of DNA sequences, it is possible to design of such DNAsequences with one or more of the following exemplary features: a) forexample, requiring all possible DNA binding sites within a given bindingspace (termed a ‘Hamming ball’ (a ‘Hamming ball’ refers to a particulark-mer and the k-mers that are within an arbitrary number of mismatchesfrom it) to be located on separate molecules in the collection; b) forexample, requiring multiple copies of a given k-mer, with each copyflanked by unique flanking sequence so as to take into account potentialjunction effects; c) for example, requiring that a given k-mer, if it islocated at one end of one molecule, then it should preferably be locatedeither centrally within or at the other end of another molecule (e.g.,to account for possible steric effects, surface attachment, synthesis,and/or the requirement for flanking DNA sequence); 4) designingmolecules of the collection such that k-mers within a given Hamming ballmight be found on either strand (forward or reverse complement), ifdouble-stranded molecules are used.

There are a number of possible ways to design such a set of DNAsequences. In one embodiment, we have calculated what we believe to beis a maximally compact representation of all possible binding sites. Forthis approach, we addresses requirements for generating one string thatcontains every k-mer exactly once, when considering words in a stackedfashion. We formalized the notions of 1) ‘discernablility’ e.g.,ensuring that for every ‘Hamming ball’, for each of its words thereexists at least one spot that occurred without another word from thatHamming ball, and 2) ‘recoverability’ of Hamming balls. These twoconcepts (discernability and recoverability) jointly ensure that one canfigure out which particular k-mer on a bound DNA sequence is actuallybound. We ran computational simulations that indicated that most wordsin most Hamming balls are discernable, and that a Hamming ball isgenerally recoverable. This approach enables extracting any possible DNAbinding site from such collections of molecules. If the collection isimplemented as an array, the array can distinguish between different DNAbinding sites, including similar sites.

A collection of nucleic acid molecules described herein has numeroususes. Among them is the formation of an array of oligonucleotides forstudying the binding properties of a compound, e.g., a nucleic acidbinding proteins. Exemplary nucleic acid binding proteins includenatural and designed nucleic acid binding proteins (e.g., zinc fingerproteins). Such an array of DNA oligonucleotides can be contacted by alabeled DNA binding protein; and by analyzing which oligonucleotides arebound, the sequence of the preferred binding site and the relativestrengths of binding to related sites can be determined. More than oneprotein can be bound to such an array simultaneously. Proteins thatcompete or cooperate, or binding site variants of the same protein, canbe bound simultaneously to analyze the binding site differences. Thesemethods frequently use fluorescently labeled protein and fluorescentmicroarray scanners. Another use for these arrays is the binding of oneor more proteins and the interrogation of each oligonucleotide spot witha laser for the purpose of identifying proteins using mass spectrometry.This embodiment enables identification and relative quantification ofeach protein bound to a given oligonucleotide, e.g., without labelingthe protein.

Such a design of nucleic acid sequences is not limited todouble-stranded DNA. It is also not limited to DNA microarrays, as sucha set of nucleic acid sequences can also be used in solution.

In one embodiment, it is possible to use a highly parallel in vitromicroarray technology for high-throughput characterization of thesequence specificities of DNA-protein interactions. We shall refer tothis approach as protein binding microarray technology, or simply “PBM.”PBM technology allows one to determine the binding site specificities ofknown, designed, or predicted transcription factors in a useful timeframe, for example, in a single day. The method can include providing apurified or at least partially purified TF.

Such PBM experiments may be particularly useful when ChIP-chipexperiments for particular TFs do not result in enough enrichment ofbound fragments in the immunoprecipitated sample to permitidentification of the DNA sites bound in vivo. The PBM data may alsoprovide valuable data for those TFs for which it is not known under whatculture conditions they are expressed or at what timepoints they areexpressed. Moreover, there are hundreds of predicted DNA bindingproteins that could be screened for sequence-specific binding using therapid, high-throughput PBM experiments.

The advantages include the provision of numerous DNA sequence variantsin a space- and cost-efficient manner. Only a minimal number ofindividual DNA sequences and individual DNA spots need to besynthesized. This also has implications, for example, on the requireddensity and number of the spots on synthesized microarrays.

In another aspect, the disclosure features a collection of nucleic acidthat includes all or a certain fraction (e.g., at least 70, 80, 90, 95,98, 99% of all) intergenic sequences from a chromosome of an organism,or from a genome of an organism. Related collections can include all ora certain fraction (e.g., at least 70, 80, 90, 95, 98, 99% of all)sequence that are within a certain distance of identifiable RNA startsites, TATA boxes, Hogness boxes, Pribnow boxes, or Shine-Dalganosequences. For example, the sequences can be between 100, 200, 500, 800,1000, or 5000 nucleotides of such sites. The nucleic acids of thecollections can be of any length, e.g., an amplifiable length, or alength ranging from 60 basepairs to 1500 basepairs or 30 bp to 2400 bp,or 30 bp to 200 bp. The nucleic acids can be immobilized, e.g., on oneor more arrays. In some embodiments, the collection also includesnucleic acids that are intragenic, e.g., coding sequences, introns, oruntranslated regulatory sequences.

In an exemplary methods, the nucleic acids of the collection arecontacted to an agent, e.g., a protein, small molecule, etc, andinteractions between the agent and one or more of the nucleic acids areevaluated. For example, the protein can be a transcription factor or anucleic acid binding fragment thereof.

All references, patents, and patent applications are hereby incorporatedby reference in their entireties. The following references are amongthose so incorporated: US 2003-0215856, US 2003-0108880, U.S. Ser. Nos.60/227,900, 60/564,864, 60/587,066 (inclusives of all Appendices andFigures) and PCT/US01/26435.

DETAILED DESCRIPTION

A collection of polymers which include all or a certain percentage ofall k-mers has a variety of uses. For example, the polymers can benucleic acids or polypeptides (e.g., short peptides of length less than50, 40, 30, 20, or 10 amino acids). The collection can be used tocharacterize interactions between an agent or a pool that includesmultiple agents and members of the collection. The results of suchcharacterization can indicate which k-mers interact with the agent orthe pool. Although such collections of polymers have numerousapplications, the following are some exemplary ones.

In one embodiment, the DNA binding specificity of a test compound, e.g.,a test macromolecule such as a protein or a fragment thereof, can becharacterized. For example, the DNA binding domain of a transcriptionfactor can be contacted to members of the collection. The DNA bindingdomain can be labeled and the members of the collection can be disposedon an array. After contacting the domain to the array and washing thearray, the array can be imaged to identify which members of thecollection are bound by the domain. The method can be adapted, e.g., toidentify one or more functional fragments that interact with DNA.

Results obtained with the candidate fragments can be compared to that ofa larger fragment or the entire protein. Thus, information is accruedabout the relative contributions of different regions of the protein tobinding specificity and affinity. A similar method can be used for DNAbinding domains from other types of nuclear proteins, e.g., centromericproteins, telomeric proteins, and so forth. Protein complexes, includinghomo- and hetero-oligomers, can be analyzed. Likewise, RNA bindingproteins or RNA binding domains thereof can be characterized using anembodiment in which the members of the collection are RNA.

The collection of polymers can be used to design a nucleic acid bindingprotein with a desired DNA binding specificity. A known nucleic acidbinding domain or scaffold (e.g., zinc finger domains) can be randomizedor specifically mutated, e.g., at DNA contacting positions. DNAcontacting positions can be identified by inspection ofthree-dimensional structures, e.g., obtained by X-ray crystallography orNMR. These positions can be randomized, e.g., to all possible aminoacids, all nineteen non-cysteine amino acids, hydrophilic amino acids,or a combination of hydrophilic and aliphatic amino acids. Mutateddomains are contacted to the collection of polymers, and variants whichhave a desired binding specificity can be used for engineering thenucleic acid binding protein. For example, variants that interact with asite present in a target gene can be selected and used as the DNAbinding domain of an artificial transcription factor that regulates thattarget gene. The artificial transcription factor can also include atranscriptional regulatory domain, e.g., an activation or repressiondomain. DNA binding domains can be constructed by linking differentmodules, each with a desired binding specificity, in order to produce achimeric protein that recognizes a site. For example, one can identifyor design a first polypeptide that interacts with a first k-mer and asecond polypeptide that interacts with a second k-mer. Then the firstand second polypeptide can linked (e.g., by a covalent or non-covalentlinkage, e.g., by making a chimeric polypeptide that includes both thefirst and second polypeptide) to provide a protein that recognizes asite that includes the first and second k-mers. The first and secondk-mers can be directly adjacent or gapped, e.g., by at least 1, 2, 3, 4,or 5 nucleotides, e.g., by one or more turns of the DNA helix.

Similar analyses can be used to characterize non-proteinaceous agents.For example, members of a chemical library or designed chemicals (e.g.,polyamides and intercalators) can be evaluated, e.g., for bindingspecificity and affinity.

The collection of polymers can be used in implementations that do notfeature an array. For example, it is possible to prepare the collectionso that each polymer includes one or more invariant termini that serveas primer binding sites. The members of the collection are contained insolution, e.g., in a single container. A protein of interest can becontacted to the solution and immobilized on a solid support. After thecontacting, the support can be washed to remove unbound members of thecollection. The bound members can be characterized, e.g., by PCRamplification with primers that recognize the invariant termini.Amplification products can be sequenced to determine the identity of themembers that bound to the protein of interest.

In addition to binding specificity, it is possible to characterize othertypes of interactions, e.g., enzymatic interactions. For example, aenzyme that specifically interacts with DNA can be evaluated using thecollection of polymers. In one implementation, a site-specific nucleaseis contacted to the collection. Members of the collection that arecleaved by the nuclease are identified. For example, if the members arepresent on an array and are end-labelled, the cleaved members areidentified by release of the label from the respective address of thearray. In other types of assays, the DNA or other nucleic acid can bemodified, e.g., with a labeled molecule. Other types of enzymaticreactions that can be evaluated include methylation, acetylation (e.g.,of histone bound polymers), polymerization, deamination, and so forth.

Similar applications are also available where the polymers includepeptide nucleic acids (PNAs), peptides (e.g., polypeptides or shortpeptides), or artificial polymers (e.g., peptoids). See, e.g., Simon etal. (1992) Proc. Natl. Acad. Sci. USA 89:9367-71 and Horwell (1995)Trends Biotechnol. 13:132-4. For example, a collection of such polymerscan be used to evaluate kinase specificity, e.g., by identifying whichmembers of the collection are phosphorylated by a particular kinase. Thecollection can also be modified so that, rather than include all or acertain percentage of all k-mers, it includes all or a certainpercentage of all k-mers with a certain property, e.g., all k-mers thatinclude at least one serine, or all k-mers that include at least onetyrosine, or all k-mers that include at least one residue that can bephosphorylated (e.g., histidine, serine, threonine, or tyrosine). Ofcourse, collections of polymers other than nucleic acids can also beused to evaluate binding interactions and other types interactions.

The collection of polymers can be used to identify molecules with anydesired activity, e.g., a binding or other functional activity.

Polymer Arrays

A collection of polymers described herein can be immobilized on anarray. An array is a substrate that includes a plurality of addresses.Each address can include a homogenous population of immobilized nucleicacids, e.g., nucleic acids of predetermined sequence. The density ofaddresses can be at least 10, 50, 200, 500, 10³, 10⁴, 10⁵, or 10⁶addresses per cm², and/or no more than 10, 50, 100, 200, 500, 10³, 10⁴,10⁵, or 10⁶ addresses/cm². Addresses in addition to addresses of theplurality can be deposited on the array. The addresses can bedistributed, on the substrate in one dimension, e.g., a linear array; intwo dimensions, e.g., a planar array; or in three dimensions, e.g., athree dimensional array. (e.g., layers of a gel matrix). The term“microarray” is used interchangeably with the term “array.” The term“array” also refers to any coatings or surfaces on a substrate such thata molecule that is said to be “on” or physically associated with anarray may be physically associated with such a coating or surface.

In one embodiment, the substrate is an insoluble or solid substrate.Potentially useful insoluble substrates include: mass spectroscopyplates (e.g., for MALDI), glass (e.g., functionalized glass, a glassslide, porous silicate glass, a single crystal silicon, quartz,UV-transparent quartz glass), plastics and polymers (e.g., polystyrene,polypropylene, polyvinylidene difluoride, poly-tetrafluoroethylene,polycarbonate, PDMS, acrylic), metal coated substrates (e.g., gold),silicon substrates, latex, membranes (e.g., nitrocellulose, nylon). Theinsoluble substrate can also be pliable. The substrate can be opaque,translucent, or transparent. In some embodiments, the array is merelyfashioned from a multiwell plate, e.g., a 96- or 384-well microtitreplate.

A variety of methods can be used to prepare an array. In someembodiments, polymers are synthesized in situ, e.g., on the array. Inother embodiments, the polymers are synthesized and then disposed on thearray. The polymers are synthesized according to one of the sequencedesign methods described herein.

1. Light-Directed Methods

Where a single solid support is employed, the oligonucleotides can beformed using a variety of techniques known to those skilled in the artof polymer synthesis on solid supports. For example, light-directedmethods are described in U.S. Pat. Nos. 5,143,854 and 5,510,270 and5,527,681. These methods, involve activating predefined regions of asolid support and then contacting the support with a preselected monomersolution. These regions can be activated with a light source, typicallyshown through a mask (much in the manner of photolithography techniquesused in integrated circuit fabrication). Other regions of the supportremain inactive because illumination is blocked by the mask and theyremain chemically protected. Thus, a light pattern defines which regionsof the support react with a given monomer. By repeatedly activatingdifferent sets of predefined regions and contacting different monomersolutions with the support, a diverse array of polymers is produced onthe support. Other steps, such as washing unreacted monomer solutionfrom the support, can be used as necessary. Other applicable methodsinclude mechanical techniques such as those described in PCT No.92/10183 and U.S. Pat. No. 5,384,261. Still further techniques includebead based techniques such as those described in PCT US/93/04145 and pinbased methods such as those described in U.S. Pat. No. 5,288,514.

The surface of a solid support, optionally modified with spacers havingphotolabile protecting groups such as NVOC and MeNPOC, is illuminatedthrough a photolithographic mask, yielding reactive groups (typicallyhydroxyl groups) in the illuminated regions. A 3′-O-phosphoramiditeactivated deoxynucleoside (protected at the 5′-hydroxyl with aphotolabile protecting group) is then presented to the surface andchemical coupling occurs at sites that were exposed to light. Followingcapping and oxidation, the support is rinsed and the surface illuminatedthrough a second mask, to expose additional hydroxyl groups forcoupling. A second 5′-protected, 3′-O-phosphoramidite activateddeoxynucleoside is presented to the surface. The selectivephotodeprotection and coupling cycles are repeated until the desired setof oligonucleotides is produced. Alternatively, an oligomer of from, forexample, 4 to 30 nucleotides can be added to each of the preselectedregions rather than synthesize each member one nucleotide monomer at atime.

2. Flow Channel or Spotting Methods

Additional methods applicable to array synthesis on a single support aredescribed in U.S. Pat. No. 5,384,261. In the methods disclosed in theseapplications, reagents are delivered to the support by either (1)flowing within a channel defined on predefined regions or (2) “spotting”on predefined regions. Other approaches, as well as combinations ofspotting and flowing, may be employed as well. In each instance, certainactivated regions of the support are mechanically separated from otherregions when the monomer solutions are delivered to the various reactionsites.

A typical “flow channel” method can generally be described as follows:Diverse polymer sequences are synthesized at selected regions of a solidsupport by forming flow channels on a surface of the support throughwhich appropriate reagents flow or in which appropriate reagents areplaced. For example, assume a monomer “A” is to be bound to the supportin a first group of selected regions. If necessary, all or part of thesurface of the support in all or a part of the selected regions isactivated for binding by, for example, flowing appropriate reagentsthrough all or some of the channels, or by washing the entire supportwith appropriate reagents. After placement of a channel block on thesurface of the support, a reagent having the monomer A flows through oris placed in all or some of the channel(s). The channels provide fluidcontact to the first selected regions, thereby binding the monomer A tothe support directly or indirectly (via a spacer) in the first selectedregions.

Thereafter, a monomer B is coupled to second selected regions, some ofwhich may be included among the first selected regions. The secondselected regions will be in fluid contact with a second flow channel(s)through translation, rotation, or replacement of the channel block onthe surface of the support; through opening or closing a selected valve;or through deposition of a layer of chemical or photoresist. Ifnecessary, a step is performed for activating at least the secondregions. Thereafter, the monomer B is flowed through or placed in thesecond flow channel(s), binding monomer B at the second selectedlocations. In this particular example, the resulting sequences bound tothe support at this stage of processing will be, for example, A, B, andAB. The process is repeated to form a vast array of sequences of desiredlength at known locations on the support.

After the support is activated, monomer A can be flowed through some ofthe channels, monomer B can be flowed through other channels, a monomerC can be flowed through still other channels, etc. In this manner, manyor all of the reaction regions are reacted with a monomer before thechannel block must be moved or the support must be washed and/orreactivated. By making use of many or all of the available reactionregions simultaneously, the number of washing and activation steps canbe minimized. One of skill in the art will recognize that there arealternative methods of forming channels or otherwise protecting aportion of the surface of the support. For example, a protective coatingsuch as a hydrophilic or hydrophobic coating (depending upon the natureof the solvent) is utilized over portions of the support to beprotected, sometimes in combination with materials that facilitatewetting by the reactant solution in other regions. In this manner, theflowing solutions are further prevented from passing outside of theirdesignated flow paths.

The “spotting” methods of preparing compounds and arrays can beimplemented in much the same manner. A first monomer, A, can bedelivered to and coupled with a first group of reaction regions whichhave been appropriately activated. Thereafter, a second monomer, B, canbe delivered to and reacted with a second group of activated reactionregions. Unlike the flow channel embodiments described above, reactantsare delivered in relatively small quantities by directly depositing themin selected regions. In some steps, the entire support surface can besprayed or otherwise coated with a solution, if it is more efficient todo so. Precisely measured aliquots of monomer solutions may be depositeddropwise by a dispenser that moves from region to region. Typicaldispensers include a micropipette to deliver the monomer solution to thesupport and a robotic system to control the position of the micropipettewith respect to the support, or an ink-jet printer. In otherembodiments, the dispenser includes a series of tubes, a manifold, anarray of pipettes, or the like so that various reagents can be deliveredto the reaction regions simultaneously.

3. Pin-Based Methods

Another method which is useful for the preparation of the immobilizedarrays involves “pin-based synthesis.” This method, which is describedin detail in U.S. Pat. No. 5,288,514 utilizes a support having aplurality of pins or other extensions. The pins are each insertedsimultaneously into individual reagent containers in a tray. An array of96 pins is commonly utilized with a 96-container tray, such as a 96-wellmicrotitre dish. Each tray is filled with a particular reagent forcoupling in a particular chemical reaction on an individual pin.Accordingly, the trays can often contain different reagents. Since thechemical reactions have been optimized such that each of the reactionscan be performed under a relatively similar set of reaction conditions,it becomes possible to conduct multiple chemical coupling stepssimultaneously. The method can use a support with a spacer, S, havingactive sites. In the particular case of oligonucleotides, for example,the spacer may be selected from a wide variety of molecules which can beused in organic environments associated with synthesis as well asaqueous environments associated with binding studies such as may beconducted between the nucleic acid members of the array and othermolecules. These molecules include, but are not limited to, proteins (orfragments thereof), lipids, carbohydrates, proteoglycans and nucleicacid molecules. Examples of suitable spacers are polyethyleneglycols,dicarboxylic acids, polyamines and alkylenes, substituted with, forexample, methoxy and ethoxy groups. Additionally, the spacers will havean active site on the distal end. The active sites are optionallyprotected initially by protecting groups. Among a wide variety ofprotecting groups which are useful are FMOC, BOC, t-butyl esters,t-butyl ethers, and the like.

Various exemplary protecting groups are described in, for example,Atherton et al., 1989, Solid Phase Peptide Synthesis, IRL Press. In someembodiments, the spacer may provide for a cleavable function by way of,for example, exposure to acid or base.

b. Arrays on Multiple Supports

Yet another method which is useful for synthesis of compounds and arraysincludes “bead based synthesis.” A general approach for bead basedsynthesis is described in PCT/US93/04145 (filed Apr. 28, 1993).

c. Protein and Peptide Arrays

US 2002-0192673 describes exemplary methods for making a protein arraywhich include protein translation. U.S. Pat. No. 5,143,854 describesmethods that include synthetic amino acid coupling.

Nucleic Acid Binding Proteins

A collection of nucleic acids, as described herein, can be used toevaluate the binding properties of nucleic acid binding proteins(NABPs).

A variety of protein structures are known to bind nucleic acids withhigh affinity and high specificity. These structures are used in a largenumber of different proteins, including proteins that specificallycontrol nucleic acid function. For reviews of structural motifs whichrecognize double stranded DNA, see, e.g., Pabo and Sauer (1992) Annu.Rev. Biochem. 61:1053-95; Patikoglou and Burley (1997) Annu. Rev.Biophys. Biomol. Struct. 26:289-325; Nelson (1995) Curr Opin Genet Dev.5:180-9. A few non-limiting examples of nucleic acid binding domainsinclude:

Zinc fingers. Zinc fingers are small polypeptide domains ofapproximately 30 amino acid residues in which there are four aminoacids, either cysteine or histidine, appropriately spaced such that theycan coordinate a zinc ion (For reviews, see, e.g., Klug and Rhodes,(1987) Trends Biochem. Sci.12:464-469(1987); Evans and Hollenberg,(1988) Cell 52:1-3; Payre and Vincent, (1988) FEBS Lett. 234:245-250;Miller et al., (1985) EMBO J. 4:1609-1614; Berg, (1988) Proc. Natl.Acad. Sci. U.S.A. 85:99-102; Rosenfeld and Margalit, (1993) J. Biomol.Struct. Dyn. 11:557-570). Hence, zinc finger domains can be categorizedaccording to the identity of the residues that coordinate the zinc ion,e.g., as the Cys₂-His₂ class, the Cys₂-Cys₂ class, the Cys₂-CysHisclass, and so forth. The zinc coordinating residues of Cys₂-His₂ zincfingers have a typically spacing which is described in Wolfe et al.,(1999) Annu. Rev. Biophys. Biomol. Struct. 3:183-212. Typically, theintervening amino acids fold to form an anti-parallel β-sheet that packsagainst an α-helix, although the anti-parallel β-sheets can be short,non-ideal, or non-existent. The fold positions the zinc-coordinatingside chains so they are in a tetrahedral conformation appropriate forcoordinating the zinc ion. The base contacting residues are at theN-terminus of the finger and in the preceding loop region. A zinc fingerDNA-binding protein normally consists of a tandem array of three or morezinc finger domains.

The zinc finger domain (or “ZFD”) is one of the most common eukaryoticDNA-binding motifs, found in species from yeast to higher plants and tohumans. By one estimate, there are at least several thousand zinc fingerdomains in the human genome alone, possibly at least 4,500. Zinc fingerdomains can be isolated from zinc finger proteins. Non-limiting examplesof zinc finger proteins include CF2-II, Kruppel, WT1, basonuclin,BCL-6/LAZ-3, erythroid Kruppel-like transcription factor, transcriptionfactors Sp1, Sp2, Sp3, and Sp4, transcriptional repressor YY1, EGR1/Krox24, EGR2/Krox20, EGR3/Pilot, EGR4/AT133, Evi-1, GL11, GL12, GL13,HIV-EP11/ZNF40, HIV-EP2, KRI, ZfX, ZfY, and ZNF7.

Computational methods described below can be used to identify all zincfinger domains encoded in a sequenced genome or in a nucleic aciddatabase. Any such zinc finger domain can be utilized. In addition,artificial zinc finger domains have been designed, e.g., usingcomputational methods (e.g., Dahiyat and Mayo, (1997) Science 278:82-7).

Homeodomains. Homeodomains are simple eukaryotic domains that consist ofa N-terminal arm that contacts the DNA minor groove, followed by threeα-helices that contact the major groove (for a review, see, e.g.,Laughon, (1991) Biochemistry 30:11357-67). The third cc-helix ispositioned in the major groove and contains critical DNA-contacting sidechains. Homeodomains have a characteristic highly-conserved motifpresent at the turn leading into the third a:-helix. The motif includesan invariant tryptophan that packs into the hydrophobic core of thedomain. Homeodomains are commonly found in transcription factors thatdetermine cell identity and provide positional information duringorganismal development. Such classical homeodomains can be found in thegenome in clusters such that the order of the homeodomains in thecluster approximately corresponds to their expression pattern along abody axis. Homeodomains can be identified by alignment with ahomeodomain, e.g., Hox-1, or by alignment with a homeodomain profile ora homeodomain hidden Markov Model (HMM; see below), e.g., PF00046 of thePfam database or “HOX” of the SMART database (see online resourcesreferenced in Letunic et al. (2004) Nucleic Acids Res. 2004 Jan 1 ;32Database issue:D142-4), or by the Prosite motif PDOC00027 as mentionedabove.

Helix-turn-helix proteins. This DNA binding motif is common among manyprokaryotic transcription factors. There are many subfamilies, e.g., theLacI family, the AraC family, to name but a few. The two helices in thename refer to a first α-helix that packs against and positions a secondα-helix in the major groove of DNA. These domains can be identified byalignment with a HMM, e.g., HTH_ARAC, HTH_ARSR, HTH_ASNC, HTH_CRP,HTH_DEOR, HTH_DTXR, HTH_GNTR, HTH_ICLR, HTH_LACI, HTH_LUXR, HTH_MARR,HTH_MERR, and HTH_XRE profiles available in the SMART database.

Helix-loop-helix proteins. This DNA binding domain is commonly foundamong homo- and hetero-dimeric transcription factors, e.g., MyoD, fos,jun, E11, and myogenin. The domain consists of a dimer, each monomercontributing two α-helices and intervening loop. The domain can beidentified by alignment with a HMM, e.g., the “HLH” profile available inthe SMART database. Although helix-loop-helix proteins are typicallydimeric, monomeric versions can be constructed by engineering apolypeptide linker between the two subunits such that a single openreading frame encodes both the two subunits and the linker.

Some nucleic acid binding domains can bind to both DNA and RNA. Forexample, certain zinc finger domains and homeodomains can interact withRNA as well as DNA. In addition, a number of RNA binding domains, bothnatural and artificial, are known. For example, the HIV tat proteinincludes an RNA binding domain that is arginine rich. See generally,Tian et al. (2003) Prog Nucleic Acid Res Mol Biol. 74:123-58; Das et al.(2003) Biopolymers. 70(1):80-5; and Doyle et al. (2002) J Struct Biol.140(1-3):147-53.

Identification of Protein Domains

A variety of methods can be used to identify structural domains, e.g.,nucleic acid binding domains.

Computational Methods. The amino acid sequence of a DNA binding domainisolated by a method described herein can be compared to a database ofknown sequences, e.g., an annotated database of protein sequences or anannotated database which includes entries for nucleic acid bindingdomains. In another implementation, databases of uncharacterizedsequences, e.g., unannotated genomic, EST or full-length cDNA sequence;of characterized sequences, e.g., SwissProt or PDB; and of domains,e.g., Pfam, ProDom (Servant et al (2002) Brief Bioinform. September2002;3(3):246-51 and SMART (Simple Modular Architecture Research Tool,see above) can provide a source of nucleic acid binding domainsequences. Nucleic acid sequence databases can be translated in all sixreading frames for the purpose of comparison to a query amino acidsequence. Nucleic acid sequences that are flagged as encoding candidatenucleic acid binding domains can be amplified from an appropriatenucleic acid source, e.g., genomic DNA or cellular RNA. Such nucleicacid sequences can be cloned into an expression vector. The proceduresfor computer-based domain identification can be interfaced with anoligonucleotide synthesizer and robotic systems to produce nucleic acidsencoding the domains in a high-throughput platform. Cloned nucleic acidsencoding the candidate domains can also be stored in a host expressionvector and shuttled easily into an expression vector, e.g., into atranslational fusion vector with Zif268 fingers 1 and 2, either byrestriction enzyme mediated subcloning or by site-specific, recombinasemediated subcloning (see U.S. Pat. No. 5,888,732). The high-throughputplatform can be used to generate multiple microtitre plates containingnucleic acids encoding different candidate nucleic acid binding domains.

Detailed methods for the identification of domains from a startingsequence or a profile are well known. See, for example, Prosite (Hofmannet al., (1999) Nucleic Acids Res. 27:215-219), FASTA, BLAST (Altschul etal., (1990) J. Mol Biol. 215:403-10.), etc. A simple string search canbe done to find amino acid sequences with identity to a query sequenceor a query profile, e.g., using Perl to scan text files. Sequences soidentified can be about 30%, 40%, 50%, 60%, 70%, 80%, 90%, or greateridentical to an initial input sequence.

Domains similar to a query domain can be identified from a publicdatabase, e.g., using the XBLAST programs (version 2.0) of Altschul etal., (1990) J. Mol. Biol. 215:403-10. For example, BLAST proteinsearches can be performed with the XBLAST parameters as follows:score=50, wordlength=3. Gaps can be introduced into the query orsearched sequence as described in Altschul et al., (1997) Nucleic AcidsRes. 25(17):3389-3402. Default parameters for XBLAST and Gapped BLASTprograms are available from on-line resources of the National Center ofBiotechnology Information, National Institutes of Health, Bethesda Md.

The Prosite profiles PS00028 and PS50157 can be used to identify zincfinger domains. In a SWISSPROT release of 80,000 protein sequences,these profiles detected 3189 and 2316 zinc finger domains, respectively.Profiles can be constructed from a multiple sequence alignment ofrelated proteins by a variety of different techniques. Gribskov andco-workers (Gribskov et al., (1990) Meth. Enzymol. 183:146-159) utilizeda symbol comparison table to convert a multiple sequence alignmentsupplied with residue frequency distributions into weights for eachposition. See, for example, the PROSITE database and the work of Luethyet al., (1994) Protein Sci. 3:139-1465.

Hidden Markov Models (HMM's) representing a DNA binding domain ofinterest can be generated or obtained from a database of such models,e.g., the Pfam database, release 2.1. A database can be searched, e.g.,using the default parameters, with the HMM in order to find additionaldomains (see, e.g., the Sanger Center, Cambridge UK for defaultparameters). Alternatively, the user can optimize the parameters. Athreshold score can be selected to filter the database of sequences suchthat sequences that score above the threshold are displayed as candidatedomains. A description of the Pfam database can be found in Sonhammer etal., (1997) Proteins 28(3):405-420, and a detailed description of HMMscan be found, for example, in Gribskov et al., (1990) Meth. Enzymol.183:146-159; Gribskov et al., (1987) Proc. Natl. Acad. Sci. USA84:4355-4358; Krogh et al., (1994) J. Mol. Biol. 235:1501-1531; andStultz et al., (1993) Protein Sci. 2:305-314.

The SMART database of HMM's (Simple Modular Architecture Research Tool,available from online resources of EMBL, Heidelberg, Germany; Schultz etal., (1998) Proc. Natl. Acad. Sci. USA 95:5857 and Schultz et al.,(2000) Nucl. Acids Res 28:231) provides a catalog of zinc finger domains(ZnF_C2H2, ZnF_C2C2; ZnF_C2HC; ZnF_C3H1; ZnF_C4; ZNF_CHCC; ZnF_GATA; andZnF_NFX) identified by profiling with the hidden Markov models of theHMMer2 search program (Durbin et al., (1998) Biological sequenceanalysis: probabilistic models of proteins and nucleic acids. CambridgeUniversity Press).

Hybridization-based Methods. A collection of nucleic acids encodingvarious forms of a DNA binding domain can be analyzed to profilesequences encoding conserved amino- and carboxy-terminal boundarysequences. Degenerate oligonucleotides can be designed to hybridize tosequences encoding such conserved boundary sequences. Moreover, theefficacy of such degenerate oligonucleotides can be estimated bycomparing their composition to the frequency of possible annealing sitesin known genomic sequences. Multiple rounds of design can be used tooptimize the degenerate oligonucleotides.

A library of nucleic acid domains can be constructed by isolation ofnucleic acid sequences encoding domains from genomic DNA or cDNA ofeukaryotic organisms such as humans. Multiple methods are available fordoing this. For example, a computer search of available amino acidsequences can be used to identify the domains, as described above. Anucleic acid encoding each domain can be isolated and inserted into avector appropriate for the expression in cells, e.g., a vectorcontaining a promoter, an activation domain, and a selectable marker. Inanother example, degenerate oligonucleotides that hybridize to aconserved motif are used to amplify, e.g., by PCR, a large number ofrelated domains containing the motif. Moreover, screening a collectionlimited to domains of interest, unlike screening a library of unselectedgenomic or cDNA sequences, significantly decreases library complexityand reduces the likelihood of missing a desirable sequence due to theinherent difficulty of completely screening large libraries. Domains insuch libraries can be characterized, e.g., using the methods describedherein.

Engineering Nucleic Acid Binding Specificity

It is possible to use the collection of polymers described herein tocharacterize, screen and select modified proteins. For example, knownnucleic acid binding proteins can be mutated, e.g., by randomizing oneor more positions or by making site specific mutations (e.g., asubstitution, insertion, and/or deletion). Such modified proteins canthen be contacted to a collection of polymers, either individually or ingroups.

Any suitable method known in the art can be used to modify, design andconstruct nucleic acids encoding NABPs, e.g., phage display, randommutagenesis, combinatorial libraries, computer/rational design, affinityselection, PCR, cloning from cDNA or genomic libraries, syntheticconstruction and the like. Examples of site specific mutations includemodifying a DNA contacting residue, e.g., to alanine, or to a differenthydrophilic amino acids. For example, some modifications change the sizeof the side chain. Examples of randomizations include completely randomrepresentation of amino acids (e.g., all amino acids of a set can occurwith equal probability, or a probability that is a function of theircodon usage and the set of allowed codons), or partially random, e.g.,by biasing nucleotides or codons, e.g., to favor certain amino acids(e.g., wildtype amino acids) relative to others. Randomization can occurat one or more positions. For example, at least 25, 50, or 75% of theDNA contacting positions can be randomized, or all such positions can berandomized.

Randomization can be used to produce a library, e.g., a phage displaylibrary, from which useful variants can be identified. The library canbe screened, e.g., using a desired target that is immobilized. Thelibrary as a whole or a subset of the library can be contacted to thecollection of polymers, e.g., to provide information about the library'scollective ability to interact with different k-mers.

One example of a nucleic acid binding protein that can be altered toproduce an artificial transcription factor is a protein that includesone or more zinc finger domains. Such domains are typically arranged asan array of at least three fingers. Nucleic acid binding region of suchproteins can be prepared by selection in vitro (e.g., using phagedisplay) or in vivo, or by design based on a recognition code (see,e.g., WO 00/42219 and U.S. Pat. No. 6,511,808). See, e.g., Rebar et al.(1996) Methods Enzymol 267:129; Greisman and Pabo (1997) Science275:657; Isalan et al. (2001) Nat. Biotechnol 19:656; and Wu et al.(1995) Proc. Nat. Acad. Sci. USA 92:344 for, among other things, methodsfor creating libraries of varied zinc finger domains. See also, e.g.,U.S. Pat. Nos. 5,789,538; 6,453,242; U.S. 2003-0165997; Wu et al., PNAS92:344-348 (1995); Jamieson et al., Biochemistry 33:5689-5695 (1994);Rebar & Pabo, Science 263:671-673 (1994); Choo & Klug, PNAS91:11163-11167 (1994); Choo & Klug, PNAS 91: 11168-11172 (1994);Desjarlais & Berg, PNAS 90:2256-2260 (1993); Desjarlais & Berg, PNAS89:7345-7349 (1992); Pomerantz et al., Science 267:93-96 (1995);Pomerantz et al., PNAS 92:9752-9756 (1995); and Liu et al., PNAS94:5525-5530 (1997); Griesman & Pabo, Science 275:657-661(1997);Desjarlais & Berg, PNAS 91:11-99-11103(1994).

Non-Proteinaceous Nucleic Acid Binding Compounds

A collection of polymers described herein can also be used to evaluate,design, and test non-proteinaceous nucleic acid binding compounds andother agents which interact with nucleic acids. The polymers can beused, e.g., to evaluate the degree of specificity of an agent forparticular sequences. Specificity may vary across a continuum fromcompletely non-specific to broad to highly sequence specific. Again,interactions other than binding interactions can also be evaluated.

One class of compounds which can specifically interact with nucleic acidis the class of polyamides, which includes pyrrole-imidazole polyamides.See, e.g., generally, Dervan et al. (2003) “Recognition of the DNA minorgroove by pyrrole-imidazole polyamides.” Curr Opin Struct Biol.13(3):284-99. See also, e.g., U.S. Ser. No. 08/607,078, PCT/US97/03332,U.S. Ser. Nos. 08/837,524, 08/853,525, PCT/US97/12733, U.S. Ser. No.08/853,522, PCT/US97/12722, PCT/US98/06997, PCT/US98/02444,PCT/US98/02684, PCT/US98/01006, PCT/US98/03829, and PCT/US98/0714. Asdescribed in the foregoing references, polyamides comprise polymers ofamino acids covalently linked by amide bonds. Preferably, the aminoacids used to form these polymers include N-methylpyrrole (Py) andN-methylimidazole (Im). Polyamides containing pyrrole (Py), andimidazole (Im) amino acids are synthetic ligands that have an affinityand specificity for DNA comparable to naturally occurring DNA bindingproteins Trauger, J. W., Baird, E. E. & Dervan, P. B. (1996), Nature382, 559-561; Swalley, S. E., Baird, E. E. & Dervan, P. B. (1997), J.Am. Chem. Soc. 119, 6953-6961; Turner, J. M., Baird, E. E. & Dervan, P.B. (1997), J. Am. Chem. Soc. 119, 7636-7644; Trauger, J. W., Baird, E.E. & Dervan, P. B. (1998), Angewandte Chemie-International Edition 37,1421-1423; and Dervan, P. B. & Burli, R. W. (1999), Current Opinion inChemical Biology 3, 688-693. See, e.g., U.S. Pat. No. 6,559,125.Polyamines can be conformationally constrained and also derivatized,e.g., to enable conjugation to another moiety, e.g., a protein.

Other types of non-proteinaceous agents include other nucleic acids,e.g., nucleic acids that can form a triple helix that interacts with aduplex. For example, U.S. Pat. No. 6,432,638 discloseshomopyrimidinepolydeoxyribonucleotide probes with at least onedetectable marker, chemotherapeutic agent or a DNA-cleaving moietyattached to at least one predetermined position. See also U.S. Pat. No.6,403,302 to Dervan et al. The probes are said to be capable of bindingthe corresponding homopyrimidine-homopurine tracts within largedouble-stranded nucleic acids by triple-helix formation at apredetermined site, and can be used for gene therapy.

Characterizing Nucleic Acid Binding

After evaluating interaction of an agent to a collection of nucleicacids (e.g., DNAs or RNAs), one can further characterize interaction ofthe agent to one or more particular members, e.g., individually, or aparticular k-mer. For example, further characterization can be used todetermine qualitative or quantitative parameters that describe in theinteraction between the agent and a member of the collection, e.g.,binding parameters, such as binding kinetics and dissociation constants.Further characterization can also provide more sequence information,e.g., by determining the degree of specificity of the agent for aparticular identified k-mer or other sequence. Examples of such methodsinclude: Electrophoretic Mobility Shift Assay (EMSA), footprinting,surface plasmon resonance, and methylation interference.

Electrophoretic Mobility Shift Assay (EMSA). Electrophoretic mobilityshift assays can be used to characterize interactions between proteinsand nucleic acids. For example, electrophoretic mobility shift assay(EMSA) can be performed as described previously (see, e.g., Durand, D.,et at., Mol. Cell. Biol. 8:1715-1724, and Jones et al. Cell 42:5593(1985). In one implementation, binding reactions (15 μl final volume)contain 10 mM Tris-HCl (pH 7.5), 80 mM sodium chloride, 1 mMdithiothreitol, 1 mM EDTA, 5% glycerol, 1.5-2 μg of poly(dI·dC), 5 to 10μg of the protein agent, and 20,000 cpm (0.1 to 0.5 ng) of³²P-end-labeled probe (e.g., a double-stranded oligonucleotide probe).After incubation for 45-60 min on ice, the protein-probe complexes wereresolved on nondenaturing 5% polyacrylamide gels run in 1×Tris-borate-EDTA (TBE) buffer (Ausubel, F. M., et al., Green PublishingAssociates and Wiley-Interscience, New York (1987)). Oligonucleotideprobes can be labeled using the Klenow fragment. For coldoligonucleotide competition assays, a 1,000-fold molar excess ofunlabeled probe (identical to, related to, or unrelated to the probe)can be added to the binding reaction mixture 15 min into the incubation,and the mixture can be further incubated for 30 min at 4° C. prior togel loading.

DNAse I Footprinting Assay. DNase I footprinting can be used to assayfor DNA sequences which could be protected from DNase I digestion by anagent. Related methods are available for analyzing RNA. See, e.g.,Siegel (1988) Proc Natl Acad Sci USA. 85(6):1801-5. A negative andpositive control can be run on either side of the agent of interest toproduce adequate points of reference.

In one implementation, DNAse I footprinting is performed using aderivation of the procedures described by Durand et al., Mol. Cell.Biol. 8:1715 (1988) and Jones et al., Cell 42:5593 (1985). Bindingreactions are carried out under the conditions described above for EMSAbut scaled up to 50 μl. After binding, using 50 μg nuclear extracts, 50μl of a 10 mM MgCl₂/5 mM CaCl₂ solution is added and 2 μl of anappropriate DNAse I (Worthington, Freehold, N.J.) dilution is added andincubated for 1 minute on ice. DNase I digestion is stopped by adding 90μl of stop buffer (20 mM EDTA, 1% SDS, 0.2 M NaCl). After addition of 20μg yeast tRNA as carrier, the samples are extracted two times with anequal volume of phenol/chloroform (1:1) and precipitated after adjustingthe solution to 0.3 M sodium acetate and 70% ethanol. DNA samples arethen resuspended in 4 μl of an 80% formamide loading dye containing 1×TBE, bromophenol blue and xylene cyanol, heated to 90° C. for 2 minutes,and loaded on 6% polyacrylamide-urea sequencing gels.

Methlylation interference can be assayed according to the protocol ofBaldwin (Ausubel, F. M., et al., Green Publishing Associates andWiley-Interscience, New York (1987)). First, a preparative EMSA (10-foldscale up of reaction described above) is performed. Then, the nucleicacid probe (typically DNA) is eluted from the excised bands representingEMSA complexes by electroelution in a Bio-Rad apparatus. Followingpiperidine cleavage, the DNA ladders were analyzed on standard 10%polyacrylamide-urea sequencing gels (Ausubel, F. M., et al, GreenPublishing Associates and Wiley-Interscience, New York (1987)).

UV Crosslinking Analysis. Preparative EMSA can be performed as describedfor methylation interference. Before autoradiography, the gel is exposedto UV light in a STRATALINER™ (Stratagene, La Jolla, Calif.), e.g., asdescribed previously (Kelsumi, H. M., et al., Mol. Cell. Biol.13:6690-6701 (1993)). Bands are excised and heated to 70° C. in Laemmlisample buffer. Gel slices are loaded into the walls of a sodium dodecylsulfate 10% polyacrylamide electrophoresis (SDS-10% PAGE) gel run inglycine-SDS buffer (Ausubel, F. M., et al, Green Publishing Associatesand Wiley-Interscience, New York (1987)).

Interaction Profiling

In one aspect, the disclosure features a method of providing aninteraction site profile. The method includes providing an array ofpolymers, e.g., using a collection of polymers described herein,contacting the compound with the array, and identifying polymers towhich the compound interacts, thus providing a “raw” interaction siteprofile. The array includes a plurality of polymers, wherein theplurality includes all or a certain percentage of all k-mers. Typically,each polymer is positionally distinguishable from the other probes.

The term “raw” interaction site profile refers to the profile whichindicates interaction between the compound and each polymer of theplurality. In one embodiment, the interaction of the compound with theprobe results in a covalent modification of the probe, e.g., a covalentbond of the probe can be broken or formed. In a preferred embodiment,the interaction of the compound with the capture probe is a bindinginteraction wherein neither the compound nor the probe has a covalentbond broken or formed.

In one embodiment, the raw interaction site profile is a list ofobjects, each object representing one of the polymers of the array, andhaving an associated value, preferably a numerical value. The list cancontain two, three, four, five, six, seven, eight, nine, ten, 15, 20,50, 100, 1000 or more objects. In a preferred embodiment, each polymeron the array is represented by an object. In this embodiment, the listincludes as many objects as there are polymers, e.g., addresses on thearray. In another embodiment, the list includes the polymers whichinteract with the compound. Thus, the list can contain only thosepolymers for which an interaction was detected, or only those polymersfor which an interaction met a predetermined condition. Such a list hasfewer objects as members than the number of unique polymers.

The raw profile can be processed, e.g., to determine which k-mer ork-mers the compound interacts with. The results of the processing can beprovided in the form of a processed profile.

The results of the processing can be in the form of a processed profilewhich represents k-mers with which the compound interacts. In oneembodiment, the processed profile is a list of objects, each objectrepresenting one of the k-mers in the collection on the array, andhaving an associated value, preferably a numerical value. The list cancontain two, three, four, five, six, seven, eight, nine, ten, 15, 20,50, 100, 1000 or more objects. In a preferred embodiment, each k-mer inthe collection is represented by an object. In this embodiment, the listincludes as many objects as there are k-mers, e.g., as represented onthe array. In another embodiment, the list includes the k-mers whichinteract with the compound. Thus, the list can contain only those k-mersfor which an interaction was detected, or only those k-mers for which aninteraction met a predetermined condition.

In a preferred embodiment, the raw profile and/or the processed profileis stored in computer memory, such as random access memory or flashmemory, or on computer readable media, such as magnetic (e.g., adiskette, removable hard drive, or internal hard drive) or optical media(e.g., a compact disk (CD), DVD, or holographic media). A profile storedin this manner can be on a personal computer, server, e.g., a networkserver, or mainframe, and can be accessed from another device across anetwork, e.g., an intranet or internet. In another embodiment, the rawor processed profile is printed on to a media such as a plastic, a paperor a label, e.g., as a bar code or variation thereof.

The value associated with each object of an interaction site profile canbe obtained from a quantitative observation, or a qualitativeobservation, preferably a quantitative observation. In one embodiment,the associated value is a function of the amount of interaction betweenthe compound and a polymer. For example, the amount of interaction canbe the amount of binding, the amount of polymer modification, oraffinity. In a preferred embodiment, the associated value is a functionof the amount of binding between the compound and the polymer. The valuecan be a function of the amount of a quantitative observation such as afluorescent signal, a radioactive signal, or a phosphorescent signal ofa contacted polymer. The value can be provided by an instrument, e.g., aCCD camera. In one embodiment, the value is a function of the surfaceplasmon resonance at the site of a contacted polymer. In a preferredembodiment, the associated values are adjusted for a background signal.In another embodiment, the associated value is a function of moles ofbound compound. In yet another embodiment, the associated value is anaffinity, relative affinity, apparent affinity, association constant,dissociation constant, logarithm of an affinity, or free energy forbinding, of the compound for the particular polymer. In a preferredembodiment, the associated values in the list are differ. In otherwords, the list contains more than one object, and e.g., the associatedvalues of the objects in the list are not all the same. The valuesprovide a range. The values can be distributed in the range. In someembodiments, the values can approximate a Poisson distribution. The listcan contain objects whose associated values are zero, or null. The listcan contain objects whose associated values are positive or negative. Inone embodiment, the list does not contain any objects whose associatedvalues are zero or null.

In a preferred embodiment, interaction site profiles are provided for acompound at varying concentrations of the compound, e.g. an interactionsite profile is provided for a compound at a first concentration, at asecond concentration, etc. In another preferred embodiment, interactionsite profiles are provided for a compound for interaction with varyingconcentration of polymers. For example, an array can have more than oneunit, the compositions of the units being identical, but the first unithaving the polymers at a first concentration, and the second unit havingthe polymers at a second concentration, etc. In yet another preferredembodiment, interaction site profiles are provide for a compound forvarious intervals after contacting the compound to the array. Forexample, a first profile can be provided after a first interval of timehas elapsed after contacting, and a second profile can be provided aftera second interval, etc.

An interaction site profile (e.g., a raw and/or processed profile) canbe generated for any compound, e.g., a protein, a peptoid, a PNA, or achemically modified protein. In a much preferred embodiment the compoundis a protein. The polypeptide can be a nucleic acid binding protein. Inone embodiment, the protein is an RNA contacting protein such as asplicing factor, a ribosomal protein, a viral protein, an RNAmodification enzyme, a translation factor, and the like. In a preferredembodiment, the protein is a DNA contacting protein such as atranscription factor, a replication factor, a telomere binding protein,a centromere binding protein, a restriction modification enzyme, a DNAmethylase, DNA repair protein, a single-stranded DNA binding protein, arecombination protein and the like. In one embodiment, the protein is atranscription factor. The transcription factor can bind a doublestranded DNA sequence with an affinity of 10 mM, 1 mM, 100 nM or less,preferably 10 nM or less, 1 nM or less, and even more preferably 100 pMor less. The transcription factor can be selected from the groupconsisting of homeodomains, helix-turn-helix motif proteins,beta-sheets, leucine zippers, steroid receptors, and zinc fingerproteins. In one embodiment, the protein is modified or combined withnatural and exotic chemical ligands. In yet another embodiment, thecompound for which the interaction site profile is generated includesmore than one protein, e.g., a complex of proteins.

In a preferred embodiment, the protein is covalently attached tobacteriophage, e.g., a T7 phage, a lambdoid phage, or a filamentousphage. Preferably, the protein is covalent attached to a filamentousphage such as fd or M13. The protein can be covalently fused to a coatprotein by constructing a fusion gene with the gene encoding thepolypeptide an the viral coat protein gene, e.g., filamentous phage geneVIII or gene III. In another preferred embodiment, the polypeptide iscovalently attached to green fluorescent protein (GFP), or a variantthereof (such as enhanced GFP, CFP, BFP, and the like). The protein canbe covalently attached by constructing a fusion gene. In yet anotherembodiment, the protein is linked with an unrelated sequence, e.g., afusion protein, purification handle, or epitope tag. Useful examples ofsuch unrelated sequences include maltose binding protein,glutathione-S-transferase, chitin binding protein, thioredoxin,hexa-histidine (or 6-His), the “FLAG tag”, the myc epitope, and thehemagglutinin epitope.

In one embodiment, the protein contains a detectable label. Thedetectable label can be a radiolabel. Preferably, the detectable labelis a fluorescent label, e.g., malachite green, Oregon green, Texas Red,Congo Red, Cy3, SYBRGREEN™ I, or R-phycoerythrin. In another embodiment,the protein is contacted with an antibody. The antibody can contact theprotein directly or can contact a covalently attached tag, e.g., amoiety mentioned above.

In a preferred embodiment, the protein is a variant of a naturalcounterpart. The variant can have at least one amino acid differencefrom the natural counterpart. Preferably the differing amino acid islocated within 50 Ångstroms, 20 Ångstroms, or 10 Ångstroms or less ofthe bound nucleic acid in a structural model, e.g., a model built fromX-ray diffraction data, NMR restraint data, or another homology model.

Quality Control of Protein Production

A collection of polymers described herein can be used to evaluateprotein production. For example, the interactions of a purified proteinor partially purified protein and molecules of the collection can beevaluated and compared to a reference, e.g., a previous production runor previous sample of the protein. For example, when producing apharmaceutical compound, e.g., a protein therapeutic, the compound istypically purified from cruder mixture, e.g., a cell lysate or mediathat contains a cell secretion. Without needing to know the contents ofany particular sample, the interactions between a sample and themolecules of the collection can provide information about its contentsand the degree of purity of the desired product, e.g., the proteintherapeutic. Impurities may interact with different molecule in thecollection than the desired product and will be detected. Accordingly,when doing numerous production runs to produce a desired product, asample of the final product or samples from earlier purification stepscan be evaluated by contacting to molecules in the collection.

Receptor Ligands

A collection of peptides that includes all or a certain percentage ofall possible k-mers can be used to identify a ligand for receptor. Inone exemplary implementation, the peptides in the collection are locatedon different addresses of an array. A soluble form of the receptor isproduced, e.g., by expressing the extracellular domain, a fragmentthereof or a version of the protein lacking the transmembrane domain.The soluble receptor (e.g., labelled receptor) is contacted to the arrayand locations where the receptor interacts with a peptide are detected.A peptide that includes only the relevant k-mer can be synthesized andfurther characterized, e.g., by contacting the peptide to a cell thatexpresses the receptor and evaluating a biological function of the cell.

Delivering Nucleic Acid Binding Proteins

As described herein, a collection of polymers that includes all or acertain percentage of all k-mers can be used to design, select oridentify a nucleic acid binding protein, e.g., an engineered nucleicacid binding protein. The protein may have therapeutic or diagnosticuses, e.g., for detecting, preventing, or ameliorating diseases ordisorders in a subject or in cells of a subject.

Conventional viral and non-viral based gene transfer methods can be usedto introduce nucleic acids encoding engineered NABP in mammalian cellsor target tissues. Such methods can be used to administer nucleic acidsencoding NABPs to cells in vitro. Preferably, the nucleic acids encodingNABPs are administered for in vivo or ex vivo gene therapy uses.Non-viral vector delivery systems include DNA plasmids, naked nucleicacid, and nucleic acid complexed with a delivery vehicle such as aliposome. Viral vector delivery systems include DNA and RNA viruses,which have either episomal or integrated genomes after delivery to thecell. For a review of gene therapy procedures, see Anderson, Science256:808-813 (1992); Nabel & Felgner, TIBTECH 11:211-217 (1993); Mitani &Caskey, TIBTECH 11:162-166 (1993); Dillon, TIBTECH 11:167-175 (1993);Miller, Nature 357:455-460 (1992); Van Brunt, Biotechnology6(10):1149-1154 (1988); Vigne, Restorative Neurology and Neuroscience8:35-36 (1995); Kremer & Perricaudet, British Medical Bulletin51(1):31-44 (1995); Haddada et al., in Current Topics in Microbiologyand Immunology Doerfler and Bohm (eds) (1995); and Yu et al., GeneTherapy 1:13-26 (1994).

Methods of non-viral delivery of nucleic acids encoding engineered NABPsinclude lipofection, microinjection, biolistics, virosomes, liposomes,immunoliposomes, polycation or lipid:nucleic acid conjugates, naked DNA,artificial virions, and agent-enhanced uptake of DNA. Lipofection isdescribed in e.g., U.S. Pat. Nos. 5,049,386, 4,946,787; and 4,897,355)and lipofection reagents are sold commercially (e.g., TRANSFECTAM™ andLIPOFECTIN™. Cationic and neutral lipids that are suitable for efficientreceptor-recognition lipofection of polynucleotides include those ofFelgner, WO 91/17424, WO 91/16024. Delivery can be to cells (ex vivoadministration) or target tissues (in vivo administration).

The preparation of lipid:nucleic acid complexes, including targetedliposomes such as immunolipid complexes, is well known to one of skillin the art (see, e.g., Crystal, Science 270:404-410 (1995); Blaese etal., Cancer Gene Ther. 2:291-297 (1995); Behr et al., Bioconjugate Chem.5:382-389 (1994); Remy et al., Bioconjugate Chem. 5:647-654 (1994); Gaoet al., Gene Therapy 2:710-722 (1995); Ahmad et al., Cancer Res.52:4817-4820 (1992); U.S. Pat. Nos. 4,186,183, 4,217,344, 4,235,871,4,261,975, 4,485,054, 4,501,728, 4,774,085, 4,837,028, and 4,946,787).

The use of RNA or DNA viral based systems for the delivery of nucleicacids encoding engineered NABP can exploit the highly evolved processesfor targeting a virus to specific cells in the body and trafficking theviral payload to the nucleus. Viral vectors can be administered directlyto patients (in vivo) or they can be used to treat cells in vitro andthe modified cells are administered to patients (ex vivo). Conventionalviral based systems for the delivery of NABPs could include retroviral,lentivirus, adenoviral, adeno-associated and herpes simplex virusvectors for gene transfer. Viral vectors are currently the mostefficient and versatile method of gene transfer in target cells andtissues. Integration in the host genome is possible with the retrovirus,lentivirus, and adeno-associated virus gene transfer methods, oftenresulting in long term expression of the inserted transgene.Additionally, high transduction efficiencies have been observed in manydifferent cell types and target tissues.

The tropism of a retrovirus can be altered by incorporating foreignenvelope proteins, expanding the potential target population of targetcells. Lentiviral vectors are retroviral vector that are able totransduce or infect non-dividing cells and typically produce high viraltiters. Selection of a retroviral gene transfer system would thereforedepend on the target tissue. Retroviral vectors are comprised ofcis-acting long terminal repeats with packaging capacity for up to 6-10kb of foreign sequence. The minimum cis-acting LTRs are sufficient forreplication and packaging of the vectors, which are then used tointegrate the therapeutic gene into the target cell to provide permanenttransgene expression. Widely used retroviral vectors include those basedupon murine leukemia virus (MuLV), gibbon ape leukemia virus (GaLV),Simian Immuno deficiency virus (SIV), human immuno deficiency virus(HIV), and combinations thereof (see, e.g., Buchscher et al., J. Virol.66:2731-2739 (1992); Johann et al., J Virol. 66:1635-1640 (1992);Sommerfelt et al, Virol. 176:58-59 (1990); Wilson et al., J. Virol.63:2374-2378 (1989); Miller et al., J. Virol. 65:2220-2224 (1991);PCT/US94/05700).

In applications where transient expression of the NABP is preferred,adenoviral based systems are typically used. Adenoviral based vectorsare capable of very high transduction efficiency in many cell types anddo not require cell division. With such vectors, high titer and levelsof expression have been obtained. This vector can be produced in largequantities in a relatively simple system. Adeno-associated virus (“AAV”)vectors are also used to transduce cells with target nucleic acids,e.g., in the in vitro production of nucleic acids and peptides, and forin vivo and ex vivo gene therapy procedures (see, e.g., West et al.,Virology 160:38-47 (1987); U.S. Pat. No. 4,797,368; WO 93/24641; Kotin,Human Gene Therapy 5:793-801 (1994); Muzyczka, J. Clin. Invest. 94:1351(1994). Construction of recombinant AAV vectors are described in anumber of publications, including U.S. Pat. No. 5,173,414; Tratschin etal., Mol. Cell. Biol. 5:3251-3260 (1985); Tratschin, et al., MoL Cell.Biol. 4:2072-2081 (1984); Hermonat & Muzyczka, PNAS 81:6466-6470 (1984);and Samulski et al., J. Virol. 63:03822-3828 (1989).

In particular, at least six viral vector approaches are currentlyavailable for gene transfer in clinical trials, with retroviral vectorsby far the most frequently used system. All of these viral vectorsutilize approaches that involve complementation of defective vectors bygenes inserted into helper cell lines to generate the transducing agent.

pLASN and MFG-S are examples are retroviral vectors that have been usedin clinical trials (Dunbar et al., Blood 85:3048-305 (1995); Kohn etal., Nat. Med. 1:1017-102 (1995); Malech et al., PNAS 94:22 12133-12138(1997)). PA317/pLASN was the first therapeutic vector used in a genetherapy trial. (Blaese et al., Science 270:475480 (1995)). Transductionefficiencies of 50% or greater have been observed for MFG-S packagedvectors. (Ellem et al., Immunol Immunother. 44(1):10-20 (1997); Dranoffet al., Hum. Gene Ther. 1:111 -2 (1997).

Recombinant adeno-associated virus vectors (rAAV) are a promisingalternative gene delivery systems based on the defective andnonpathogenic parvovirus adeno-associated type 2 virus. All vectors arederived from a plasmid that retains only the AAV 145 bp invertedterminal repeats flanking the transgene expression cassette. Efficientgene transfer and stable transgene delivery due to integration into thegenomes of the transduced cell are key features for this vector system.(Wagner et al., Lancet 351:9117 1702-3 (1998), Kearns et al., Gene Ther.9:748-55 (1996)).

Replication-deficient recombinant adenoviral vectors (Ad) arepredominantly used for colon cancer gene therapy, because they can beproduced at high titer and they readily infect a number of differentcell types. Most adenovirus vectors are engineered such that a transgenereplaces the Ad E1a, E1b, and E3 genes; subsequently the replicationdefector vector is propagated in human 293 cells that supply deletedgene function in trans. Ad vectors can transduce multiply types oftissues in vivo, including nondividing, differentiated cells such asthose found in the liver, kidney and muscle system tissues. ConventionalAd vectors have a large carrying capacity. An example of the use of anAd vector in a clinical trial involved polynucleotide therapy forantitumor immunization with intramuscular injection (Sterman et al.,Hum. Gene Ther. 7:1083-9 (1998)). Additional examples of the use ofadenovirus vectors for gene transfer in clinical trials includeRosenecker et al, Infection 24:1 5-10 (1996); Sterman et al., Hum. GeneTher. 9:7 1083-1089 (1998); Welsh et al., Hum. Gene Ther. 2:205-18(1995); Alvarez et al., Hum. Gene Ther. 5:597-613 (1997); Topfet al.,Gene Ther. 5:507-513 (1998); Sterman et al., Hum. Gene Ther. 7:1083-1089(1998).

Packaging cells are used to form virus particles that are capable ofinfecting a host cell. Such cells include 293 cells, which packageadenovirus, and .psi.2 cells or PA317 cells, which package retrovirus.Viral vectors used in gene therapy are usually generated by producercell line that packages a nucleic acid vector into a viral particle. Thevectors typically contain the minimal viral sequences required forpackaging and subsequent integration into a host, other viral sequencesbeing replaced by an expression cassette for the protein to beexpressed. The missing viral functions are supplied in trans by thepackaging cell line. For example, AAV vectors used in gene therapytypically only possess ITR sequences from the AAV genome which arerequired for packaging and integration into the host genome. Viral DNAis packaged in a cell line, which contains a helper plasmid encoding theother AAV genes, namely rep and cap, but lacking ITR sequences. The cellline is also infected with adenovirus as a helper. The helper viruspromotes replication of the AAV vector and expression of AAV genes fromthe helper plasmid. The helper plasmid is not packaged in significantamounts due to a lack of ITR sequences. Contamination with adenoviruscan be reduced by, e.g., heat treatment to which adenovirus is moresensitive than AAV.

In many gene therapy applications, it is desirable that the gene therapyvector be delivered with a high degree of specificity to a particulartissue type. A viral vector is typically modified to have specificityfor a given cell type by expressing a ligand as a fusion protein with aviral coat protein on the viruses outer surface. The ligand is chosen tohave affinity for a receptor known to be present on the cell type ofinterest. For example, Han et al., PNAS 92:9747-9751 (1995), reportedthat Moloney murine leukemia virus can be modified to express humanheregulin fused to gp70, and the recombinant virus infects certain humanbreast cancer cells expressing human epidermal growth factor receptor.This principle can be extended to other pairs of virus, expressing aligand fusion protein and target cell expressing a receptor. Forexample, filamentous phage can be engineered to display antibodyfragments (e.g., FAB or Fv) having specific binding affinity forvirtually any chosen cellular receptor. Although the above descriptionapplies primarily to viral vectors, the same principles can be appliedto nonviral vectors. Such vectors can be engineered to contain specificuptake sequences thought to favor uptake by specific target cells.

Gene therapy vectors can be delivered in vivo by administration to anindividual patient, typically by systemic administration (e.g.,intravenous, intraperitoneal, intramuscular, subdermal, or intracranialinfusion) or topical application, as described below. Alternatively,vectors can be delivered to cells ex vivo, such as cells explanted froman individual patient (e.g., lymphocytes, bone marrow aspirates, tissuebiopsy) or universal donor hematopoietic stem cells, followed byreimplantation of the cells into a patient, usually after selection forcells which have incorporated the vector.

Ex vivo cell transfection for diagnostics, research, or for gene therapy(e.g., via re-infusion of the transfected cells into the host organism)is well known to those of skill in the art. In a preferred embodiment,cells are isolated from the subject organism, transfected with a NABPnucleic acid (gene or cDNA), and re-infused back into the subjectorganism (e.g., patient). Various cell types suitable for ex vivotransfection are well known to those of skill in the art (see, e.g.,Freshney et al., Culture of Animal Cells, A Manual of Basic Technique(3rd ed. 1994)) and the references cited therein for a discussion of howto isolate and culture cells from patients).

In one embodiment, stem cells are used in ex vivo procedures for celltransfection and gene therapy. The advantage to using stem cells is thatthey can be differentiated into other cell types in vitro, or can beintroduced into a mammal (such as the donor of the cells) where theywill engraft in the bone marrow. Methods for differentiating CD34+ cellsin vitro into clinically important immune cell types using cytokinessuch a GM-CSF, IFN-γ. and TNF-α are known (see Inaba et al., J. Exp.Med. 176:1693-1702 (1992)).

Stem cells are isolated for transduction and differentiation using knownmethods. For example, stem cells are isolated from bone marrow cells bypanning the bone marrow cells with antibodies which bind unwanted cells,such as CD4+ and CD8+ (T cells), CD45+ (panb cells), GR-1(granulocytes), and lad (differentiated antigen-presenting cells) (seeInaba et al., J. Exp. Med. 176:1693-1702 (1992)).

Vectors (e.g., retroviruses, adenoviruses, liposomes, etc.) containingtherapeutic NABP nucleic acids can be also administered directly to theorganism for transduction of cells in vivo. Alternatively, naked DNA canbe administered. Administration is by any of the routes normally usedfor introducing a molecule into ultimate contact with blood or tissuecells. Suitable methods of administering such nucleic acids areavailable and well known to those of skill in the art, and, althoughmore than one route can be used to administer a particular composition,a particular route can often provide a more immediate and more effectivereaction than another route.

Pharmaceutically acceptable carriers are determined in part by theparticular composition being administered, as well as by the particularmethod used to administer the composition. There is a wide variety ofsuitable formulations for pharmaceutical compositions, as describedherein and, e.g., in Remington's Pharmaceutical Sciences, 17th ed.,1989).

Delivery Vehicles for NABPs

An important factor in the administration of polypeptide compounds, suchas the NABPs, is ensuring that the polypeptide has the ability totraverse the plasma membrane of a cell, or the membrane of anintra-cellular compartment such as the nucleus. Cellular membranes arecomposed of lipid-protein bilayers that are freely permeable to small,nonionic lipophilic compounds and are inherently impermeable to polarcompounds, macromolecules, and therapeutic or diagnostic agents.However, proteins and other compounds such as liposomes have beendescribed, which have the ability to translocate polypeptides such asNABPs across a cell membrane.

For example, “membrane translocation polypeptides” have amphiphilic orhydrophobic amino acid subsequences that have the ability to act asmembrane-translocating carriers. In one embodiment, homeodomain proteinshave the ability to translocate across cell membranes. The shortestinternalizable peptide of a homeodomain protein, Antennapedia, was foundto be the third helix of the protein, from amino acid position 43 to 58(see, e.g., Prochiantz, Current Opinion in Neurobiology 6:629-634(1996)). Another subsequence, the h (hydrophobic) domain of signalpeptides, was found to have similar cell membrane translocationcharacteristics (see, e.g., Lin et al., J. Biol. Chem. 270:1 4255-14258(1995)).

Examples of peptide sequences which can be linked to a NABP, forfacilitating uptake of NABP into cells, include, but are not limited to:an 11 animo acid peptide of the tat protein of HIV; a 20 residue peptidesequence which corresponds to amino acids 84-103 of the p16 protein (seeFahraeus et al., Current Biology 6:84 (1996)); the third helix of the60-amino acid long homeodomain of Antennapedia (Derossi et al., J. Biol.Chem. 269:10444 (1994)); the h region of a signal peptide such as theKaposi fibroblast growth factor (K-FGF) h region (Lin et al., supra); orthe VP22 translocation domain from HSV (Elliot & O'Hare, Cell 88:223-233(1997)). Other suitable chemical moieties that provide enhanced cellularuptake may also be chemically linked to NABPs.

Toxin molecules also have the ability to transport polypeptides acrosscell membranes. Often, such molecules are composed of at least two parts(called “binary toxins”): a translocation or binding domain orpolypeptide and a separate toxin domain or polypeptide. Typically, thetranslocation domain or polypeptide binds to a cellular receptor, andthen the toxin is transported into the cell. Several bacterial toxins,including Clostridium perfringens iota toxin, diphtheria toxin (DT),Pseudomonas exotoxin A (PE), pertussis toxin (PT), Bacillus anthracistoxin, and pertussis adenylate cyclase (CYA), have been used in attemptsto deliver peptides to the cell cytosol as internal or amino-terminalfusions (Arora et al., J. Biol. Chem., 268:3334-3341 (1993); Perelle etal., Infect. Immun., 61:5147-5156 (1993); Stenmark et al., J. Cell Biol.113:1025-1032 (1991); Donnelly et al., PNAS 90:3530-3534 (1993);Carbonetti et al., Abstr. Annu. Meet. Am. Soc. Microbiol. 95:295 (1995);Sebo et al., Infect. Immun. 63:3851-3857 (1995); Klimpel et al., PNASU.S.A. 89:10277-10281 (1992); and Novak et al., J. Biol. Chem.267:17186-17193 1992)).

Such subsequences can be used to translocate NABPs across a cellmembrane. NABPs can be conveniently fused to or derivatized with suchsequences. Typically, the translocation sequence is provided as part ofa fusion protein. Optionally, a linker can be used to link the NABP andthe translocation sequence. Any suitable linker can be used, e.g., apeptide linker.

The NABP can also be introduced into an animal cell, preferably amammalian cell, via a liposomes and liposome derivatives such asimmunoliposomes. The term “liposome” refers to vesicles comprised of oneor more concentrically ordered lipid bilayers, which encapsulate anaqueous phase. The aqueous phase typically contains the compound to bedelivered to the cell, e.g., a NABP.

The liposome fuses with the plasma membrane, thereby releasing the druginto the cytosol. Alternatively, the liposome is phagocytosed or takenup by the cell in a transport vesicle. Once in the endosome orphagosome, the liposome either degrades or fuses with the membrane ofthe transport vesicle and releases its contents.

In current methods of drug delivery via liposomes, the liposomeultimately becomes permeable and releases the encapsulated compound (inthis case, a NABP) at the target tissue or cell. For systemic or tissuespecific delivery, this can be accomplished, for example, in a passivemanner wherein the liposome bilayer degrades over time through theaction of various agents in the body. Alternatively, active drug releaseinvolves using an agent to induce a permeability change in the liposomevesicle. Liposome membranes can be constructed so that they becomedestabilized when the environment becomes acidic near the liposomemembrane (see, e.g. PNAS 84:7851 (1987); Biochemistry 28:908 (1989)).When liposomes are endocytosed by a target cell, for example, theybecome destabilized and release their contents. This destabilization istermed filsogenesis. Dioleoylphosphatidylethanolamine (DOPE) is thebasis of many “fusogenic” systems.

Such liposomes typically comprise a NABP and a lipid component, e.g., aneutral and/or cationic lipid, optionally including areceptor-recognition molecule such as an antibody that binds to apredetermined cell surface receptor or ligand (e.g., an antigen). Avariety of methods are available for preparing liposomes as describedin, e.g. Szoka et al., Ann. Rev. Biophys. Bioeng. 9:467 (1980), U.S.Pat. Nos. 4,186,183, 4,217,344, 4,235,871, 4,261,975, 4,485,054,4,501,728, 4,774,085, 4,837,028, 4,235,871, 4,261,975, 4,485,054,4,501,728, 4,774,085, 4,837,028, 4,946,787, PCT Publication No. WO91.backslash.17424, Deamer & Bangham, Biochim. Biophys. Acta 443:629-634(1976); Fraley, et al., PNAS 76:3348-3352 (1979); Hope et al, Biochim.Biophys. Acta 812:55-65 (1985); Mayer et al., Biochim. Biophys. Acta858:161-168 (1986); Williams et al, PNAS 85:242-246 (1988); Liposomes(Ostro (ed.), 1983, Chapter 1); Hope et al., Chem. Phys. Lip. 40:89(1986); Gregoriadis, Liposome Technology (1984) and Lasic, Liposomes:from Physics to Applications (1993)). Suitable methods include, forexample, sonication, extrusion, high pressure/homogenization,microfluidization, detergent dialysis, calcium-induced fusion of smallliposome vesicles and ether-fusion methods, all of which are well knownin the art.

In certain embodiments, it is desirable to target the liposomes usingtargeting moieties that are specific to a particular cell type, tissue,and the like. Targeting of liposomes using a variety of targetingmoieties (e.g., ligands, receptors, and monoclonal antibodies) has beenpreviously described (see, e.g., U.S. Pat. Nos. 4,957,773 and4,603,044).

EXAMPLES Example 1 General Discussion of One Implementation

We have developed a new, highly parallel in vitro microarray technologyfor high-throughput characterization of the sequence specificities ofDNA-protein interactions. We shall refer to this approach as proteinbinding microarray (PBM) technology. PBM technology allows evaluatinginteractions between agents and nucleic acids, for example, themeasurement of direct or indirect binding of agents, such asepitope-tagged transcription factors to nucleic acids on DNA microarraysspotted with double-stranded DNAs containing potential DNA bindingsites. For instance, a DNA binding protein of interest is expressed withan epitope tag. The tag facilitates protein purification and detection.The epitope-tagged DNA binding protein (usually at least partiallypurified) is applied to a double-stranded DNA microarray. The microarrayis then washed gently to remove any nonspecifically bound protein. Theprotein-bound microarray is then labeled with a primary antibodyspecific for the epitope tag expressed as a fusion with the DNA bindingprotein.

PBM technology enables evaluating the binding site specificities of anucleic acid binding protein in a single day, starting from the purifiedprotein. Using compact microarrays, we can determine the relativebinding affinities for all possible 9-mers using a single nucleic acidarray. PBM technology is highly scalable, as many assays may beperformed in a single day. Moreover, it is a universal system, since thesame nucleic acid array can be used to evaluate proteins from anyspecies for binding to sites in any genome. One person could, forinstances, perform triplicate PBM assays with multiple proteins in oneday. DNA arrays can be printed, e.g., in production quantities.Identical arrays can be used for triplicate array experiments for eachprotein of interest.

We have successfully purified FLAG-tagged Rpn4 fusion protein from E.coli using anti-FLAG M2 affinity gel (Sigma). We used this purified Rpn4fusion protein in EMSAs successfully, indicating that the dual-taggedRpn4 fusion protein binds DNA sequence-specifically. We have also usedthis FLAG-tagged RPN4 in PBM experiments to evaluate strategies such ascrosslinking the protein-bound microarrays before labeling with afluorophore conjugate. For labeling protein-bound PBMs, we evaluated anumber of strategies, including: (1) labeling with the M2 anti-FLAGprimary antibody (Sigma), followed by R-phycoerythrin-conjugatedsecondary antibody (Sigma); (2) labeling the protein-bound PBMs withFITC- or Cy3-conjugated anti-FLAG antibody (Sigma); (3)Alexa488-conjugated M2 anti-FLAG primary antibody (Sigma). Detection canemploy a standard microarray scanners (GSI Lumonics SCANARRAY™) equippedwith the appropriate lasers and filter sets. We have seen that highersignal intensity is indicative of higher DNA-protein binding affinity.Thus, PBM technology is successful in identifying sequence-specific TFbinding. We also found that labeling with a fluorophore-conjugatedprimary antibody results in high quality data with a broad dynamic rangeof signal intensities.

We have used GST-tagged yeast TFs in PBM experiments using thesemicroarrays. The washed, protein-bound microarrays are labeled with aprimary anti-GST antibody conjugated with the fluorophore Alexa488. Thewashed arrays are then scanned using a GSI Lumonics SCANARRAY™microarray scanner. The microarray images are quantified using GENEPIX™microarray image quantitation software (Axon Instruments, Inc.). Thesequences corresponding to the spots with a Bonferroni-correctedp-value<=0.001 are run through a motif finding program to identify theTF's binding site motif. We use an integrated motif finder that combinesresults from MEME™, MDSCAN™, BIOPROSPECTOR™, and ALIGNACE™. Weidentified both ungapped (e.g. Rap1 and Mig1) and gapped (e.g. Abf1) DNAbinding site motifs. Experimental negative controls, using either GSTalone or a GST-fusion to a protein that is not DNA binding, do notresult in ‘bound’ spots on the microarray. Computational negativecontrols, in which randomly selected yeast intergenic regions aresearched with motif finding programs, do not result in motifs withsignificant group specific scores, meaning that any low scoring motifsthat do arise from these random sets are not specific to the input setof motifs (the motifs arising from the spots ‘bound’ in our PBMs arehighly specific to the set of ‘bound’ spots).

Survey of TF DNA Binding Domain Types:

The table below exemplifies some of the major classes of eukaryotic DNAbinding proteins, with the approximate number of proteins as well as thenumber of domains of each type in the human, fly, and yeast genomes.Domain type Human Fly Yeast C2H2 zinc finger  564 (4500) 234 (771) 34(56) Homeobox domain 160 (178) 100 (103) 6 Helix-loop-helix DNA 60 (61)44 4 binding domain Basic leucine zipper (bZIP) ˜55 ^(a ) 27 ^(b ) ˜20^(d ) Nuclear hormone receptor 47 17 0 Fork head domain 35 (36) 20 (21)4 Myb-like DNA-binding domain 32 (43) 18 (24) 15 (20) Ets family ˜22^(e ) ˜8 ^(c) ˜0 ^(d) C2CH zinc finger 17 (22) 6 (8) 3 (5) Paired domain˜9 ^(e) ˜17 ^(c ) ˜0 ^(d) GATA zinc finger 11 (17) 5 (6) 9 Zn(2)-Cys(6)binuclear 0 ^(e) 0 ^(c) ˜11 ^(d ) cluster (fungal) MADS domain ˜4 ^(e)˜2 ^(c) ˜1 ^(d) HSF family ˜5 ^(e) ˜1 ^(c) ˜1 ^(d)Table Legend. Major Classes of Eukaryotic DNA Binding Proteins. Thenumber of proteins containing the specified Pfam domains as well as thetotal number of domains (in parentheses) are shown in each column.Unless otherwise indicated, Celera data (Venter et al., Science, 2001)were used; Celera data were used instead of the Public Consortium databecause the Celera data provided values for a greater number ofdifferent types of DNA binding domains.^(a) indicates data from Newman and Keating, Science, 2003.^(b) indicates data from Fassler et al., Genome Res., 2002.^(c) indicates data from a polypeptide search of FlyBase.^(d) indicates a search of SGD.^(e) indicaes data on known gene names an descriptions from a genefamily search at the UCSC web browser.

In order to evaluate the general applicability of our compactcombinatorial DNA microarrays in determining transcription factor (TF)binding sites, we have selected ˜50 transcription factors (TFs) from S.cerevisiae and D. melanogaster that they span a variety of structuralclasses of DNA binding domains (DBDs) found in eukaryotes. Theseproteins can be evaluated using a compact DNA array.

Example 2 Design and Simulations of the Microarrays of CompactCombinatorial DNA

The following is one implementation for providing a collection ofsequence in which all possible DNA sequence variants can be representedon DNA microarrays in a space- and cost-efficient manner. One advantageof this technology is that only a minimal number of individual DNAsequences and individual DNA spots needs to be synthesized. All possiblebinding sites (which we term k-mers) are represented in an efficientmanner by allowing distinct, non-similar sites to overlap with eachother on a given DNA fragment. Of course, less than all possible sitescan also be used. For example, it is possible to exclude homopolymertracts of length k, k−1, or k−2, or to omit other simple sequences,e.g., tracts of length k, k−1, or k−2 that include only two types ofnucleotides.

Typically, the DNAs are significantly longer than k. The addition ofeach base to the length of the oligonucleotide (‘oligo’) adds anotherk-mer to that oligo. Compared to the use of a single k-mer per oligo,this approach reduces the number of oligos required to interrogate a setof proteins to discover their DNA binding site sequence interactionpreferences.

One or more of the factors can be included in the design of such DNAsequences: 1) placing all possible or a significant percentage of all(e.g., at least 60, 70, 80, 85, 90, 95, 98, or 99%) DNA binding siteswithin a given binding space (termed a ‘Hamming ball’; here, a ‘Hammingball’ refers to a most preferential DNA binding site plus all DNAsequences that are within an arbitrary number of mismatches from it) tobe located on separate double-stranded oligos and/or separate DNA spotson a DNA microarray; 2) including multiple copies of a given DNA k-mer,with each copy flanked by unique flanking sequence so as to take intoaccount potential junction effects; 3) positioning a given k-mer, if itis located at one end of a double-stranded DNA oligo, at a differentlocation, e.g., centrally within or at the other end of a differentdouble-stranded DNA oligo at another spot on the DNA microarray so as totake into account possible steric effects of the slide surface and/orthe requirement for flanking DNA sequence; 4) taking into account in thearray design that k-mers within a given Hamming ball might be found oneither strand (forward or reverse complement) of double-stranded DNAoligos.

There are a number of possible ways to design such a set of DNAsequences. We have calculated a maximally compact representation of allpossible binding sites (binding sites). Using a standard tool fromcryptography called a linear shift register, we were able to constructin silico an appropriate long sequence that contains all words of agiven length exactly once. See, e.g., S. Golomb, Shift RegisterSequences, Aegean Park Press, 1967. Other methods for constructing astring that contains all words (or a substantial fraction of all, e.g.,at least 70, 80, 85, 90, 95, 98, 99% of all words) of a given length canalso be used.

This long sequence is segmented into short sequences of the same lengthwith overlapping ends to ensure that none of the k-mers contained withinit is destroyed in the segmenting process. These shorter sequencescorrespond to the individual spots on our combinatorial microarrays. Inthis implementation, each of our two shift registers represents each9-mer once. Because a given k-mer and its reverse complement should bebound equally well by a given agent, our two shift registers will haveevery 9-mer represented four times. Similarly, each 8-mer can berepresented 16 times, each 7-mer 64 times, and so on.

In evaluating interaction with an array, first, it is helpful to providea means of discerning which of the words contained within a ‘bound’ spotare actually binding sites. This ensures that one can assay the relativebinding affinity of the agent for each of the different candidatebinding sites. We refer to the array design criterion of insuring thatas few distinct k-mers as possible within a given molecule arerecognized by a given agent as “discernability”.

Second, it is helpful to provide a way of recovering which k-mer in amolecule is recognized by the agent. We refer to identifying therelevant k-mer from other k-mers present in the same molecule as“recoverability.” In some cases, there can be more than one recognizedk-mer per spot, and the observed signal would presumably be due to somecombination of the affinities of the TF for each of these sites. Furthercharacterization (e.g., EMSA, footprinting, mutational analysis) can beused to identify such situations.

We have formalized the notions of: 1) ‘discernability’, e.g., ensuringthat for every ‘Hamming ball’, for each of its words there exists atleast one spot that does not contain another word from that Hammingball, and 2) ‘recoverability’ of Hamming balls. These two concepts(discernability and recoverability) jointly facilitate determining whichparticular k-mer is actually bound in a molecule. We assume that for anyagent whose binding site is of length k, most of the high affinitybinding sites can be within a given number r of mismatches of somecentral word w or its reverse complement (simplistically one could thinkof this central word as the ‘consensus’). We refer to the collection ofall words within r mismatches of w as a “ball of radius r and centeredat w,” or simply B(w,r). In the following calculations, we found whichball contains the binding sites from the collection of ‘bound’ spots,thus solving the recoverability problem. Furthermore, we show that forany word w‘in a ball B(w,r) there will almost always exist at least onespot were w′ occurs in the absence of any other word from B(w,r), thussolving the discernability problem. Here, we considered binding siteswith widths of lengths 6 through 9, and have assumed that their bindingsites are contained within balls of radius 1 or 2 (in the case of k=9 wealso looked at a radius of 3); e.g., for all words of length 6, 7, or 8,we considered all possible single and double mismatches, and for wordsof length 9, we also considered all possible triple mismatches.

In one exemplary design, oligonucleotides that are 60 nucleotides inlength are used. A 60 nucleotide oligonucleotide can include a 16 ntuniversal primer sequence incorporated at the 5′ end of each oligo. Inaddition, we have chosen to incorporate 5 N's (where N=A/C/G/T) betweenthe universal primer sequence and the combinatorial portion of thesequence so as to minimize any possible artifactual binding due tojunction effects between the universal primer sequence and thecombinatorial portion of the sequence. We have also chosen toincorporate 5 N's flanking the other end of the combinatorial portion ofthe sequence so that the combinatorial portion of the sequence is not atthe very terminus of the DNA, since many DNA binding proteins need tomake nonspecific contacts with DNA beyond nucleotides with which itmakes sequence-specific contacts. Therefore we are left with just 34 bpfor the combinatorial portion of each oligo. For binding sites 9 bp inlength, the internal 34 bp will contain 34−9+1=26 potential bindingsites. Given these constraints, a DNA microarray with 20,166 spots wouldcontain two shift registers, and thus would contain each 9-mer fourtimes (when reverse complements are considered). Spotting this number ofspots onto a standard 1×3 inch glass slide format can readily beachieved using standard robotic arraying instruments.

Provided in the Table below is an exemplary listing of how many times agiven k-mer (a “word”) can be represented in a shift register designedto contain all possible k-mers: shift register size (e.g., designed forwhat k-mer) 6 7 8 9 10 11 12 k-mer 6 2 8 32 128 512 2048 8192 (e.g.,“word” 7 2 8 32 128 512 2048 length) 8 2 8 32 128 512 9 2 8 32 128 10 28 32 11 2 8 12 2

Provided in the Table below is an exemplary listing of how many spots ona microarray would be required to represent a given shift register size(e.g., designed for what k-mer), using 2 different permissiblecombinatorial sequence lengths (44 nt, which corresponds to a 60-merminus a 16 nt universal primer sequence; 34 nt, which corresponds to a60-mer minus a 16 nt universal primer sequence, minus 5 degenerate (“N”)positions on either side of 34 combinatorial positions) as examples: #combinatorial # combinatorial positions = 44 positions = 34 registersize #spots required #spots required 6 106 142 7 432 586 8 1772 2428 97282 10083 10 29960 41944 11 123362 174763 12 508401 729445Note that up to 30,000 spots, thus shift registers designed for up to10-mers in the case of spots containing 44 combinatorial positions, andshift registers deisnged for up to 9-mers in the case of spotscontaining 34 combinatorial positions, can be easily printed ontostandard size 1 in×3 in glass slides using typical microarray printingrobotic instruments.

We have run computational simulations on our compact combinatorial arraydesign; the results of these simulations indicate that almost all wordsin almost all Hamming balls are discernable, and that it is unlikely fora Hamming ball to be unrecoverable. Any possible DNA binding site can beextracted from such ‘combinatorial arrays.’ The arrays can be used todistinguish between two very similar DNA binding sites. Fordiscernability, we worked under the ‘worst case’ assumption that allwords within the ball were binding sites; this can be considered ‘worstcase’ since the larger the number of binding sites, the harder it is todistribute them on the array in a fashion so that two binding sites donot occur on the same spot. For each k, we considered all balls ofradius r. For each ball, we determined whether each word in the balloccurred at least once on a spot in the absence of any other word in theball. For that ball, we then computed the proportion of words that werediscernable under the assumption that all words in the ball could bebinding sites. Reported below are the average values of the proportionof words in the ball centered at w that were discernable, across allballs for 6≦k≦9:

-   -   k=6, r=1: average discernability=99.998%    -   k=6, r=2: average discernability=99.790%    -   k=7, r=1: average discernability=99.995%    -   k=7, r=2: average discernability=99.803%    -   k=8, r=1: average discernability=99.997%    -   k=8, r=2: average discernability=99.897%    -   k=9, r=1: average discernability=99.954%    -   k=9, r=2: average discernability=99.474%    -   k=9, r=3: average discernability=81.578%

Thus, we see that almost all binding sites are discernable for all k andall r that we considered (for candidate binding sites of length 6, 7, or8, we considered all possible single and double mismatches, and forcandidate binding sites of length 9, we also considered all possibletriple mismatches). Similar simulations can be used to evaluateproperties of Hamming balls of size B(6,3), B(7,3), B(8,3), B(8,4), andB(9,4).

To test the expected recoverability rate, we developed a simulator thatfirst picks a ball B of a given radius r surrounding a given randomlychosen word. The program then picks a random collection of words withinthe ball to be considered binding sites and marks the spots containingthese words as ‘bound.’ The program then asks if there is anotherpossible subset of words from some ball that could have caused the exactsame collection of spots to be ‘bound’. Because there are 4^(k) ballsfor each k, and because for each candidate ball all of its subsets mustbe examined, one is limited for practical reasons in the number ofsimulations that can be performed. Therefore, for each 6≦k≦9 and each requal to 1 or 2 (or 3 for k=9), we generated 500 collections of bindingsites, and then checked each collection to see if the ball containingthese sites was recoverable. Out of all 9 of these sets of 500collections of binding sites, the ball containing the binding sites wasalways recoverable. Moreover, even if the binding sites are notrecoverable, it appears to be very likely that the binding sites wouldalways be contained within one of a very limited number of candidateballs, and the correct one could then be identified, e.g., with a smallset of EMSAs.

Furthermore, we wanted to rigorously test the hypothesis that thecollection of binding sites for a given agent can be contained within aHamming ball of a given width and radius. In other words, we wanted toensure that most transcription factor DNA binding sites were of a lengthand sequence degeneracy that would allow us to extract PBM data on eachof their individual candidate binding sites using our combinatorialarrays.

We extracted transcription factor binding site data from the TRANSFACdatabase for 110 non-redundant TFs (fungi, 1; insect, 8; plant, 12;vertebrate, 91) for which in vitro binding site selection (SELEX) datawas available. For each of these TFs we then generated the correspondingbinding site motifs. We generated a matrix P(i,j) to represent thefrequency of nucleotide i at position j of each TFs' binding site motif.Next, we assumed a background mononucleotide frequency of A or T=0.28and G or C=0.22 (this is very close to the nucleotide frequencies within50 kb upstream of transcriptional start and downstream oftranscriptional stop for the human, mouse and fly genomes). We shall useQ to denote this background probability distribution on {A,C,Q,T}. Ateach position j, we sought to compare the distribution of nucleotidesP(i,j) to the background distribution Q(i). If P(i) was ‘far’ from Q(i)at position j, then we considered position j to be part of the bindingsite motif, while if P(i) was very close to Q(i) for a given j, then wedid not consider it to be part of the binding site motif. However, sucha calculation required a definition of distance, as well as a concept ofthe average distance that any randomly selected distribution P(i,j) isfrom Q(i,j). For the metric, we chose to use relative entropy, which wedenote by R. Here, if P and P′ are any two probability distributions,then their relative entropy is given by:${R\left( {P,P^{\prime}} \right)} = {\sum\limits_{i \in {\{{A,C,G,T}\}}}{{P(i)}{\log\left( \frac{P(i)}{P^{\prime}(i)} \right)}}}$

We computed the average ‘distance’ of a generic probability distributionP from Q by integrating R(P,Q) over all possible probabilitydistributions P (this was done numerically via a Monte Carlosimulation). We let μ denote the result of this integral (approximatelyequal to 0.18). For a given TF and a given input width of binding site,we took the frequency matrix P(i,j) and the background distribution andcomputed the sum:$S = {\sum\limits_{j}\left( {{R\left( {{P\left( {i,j} \right)},Q} \right)} - \mu} \right)}$

Because the term R(P(i,j),Q)−μ has mean zero and is positive if P(i,j)is close to Q and negative if it is far, we can take the width of themotif to be the value that maximizes the quantity S. We performed thiscalculation for all 109 transcription factors for which binding sitedata from SELEX experiments was available in TRANSFAC. For thosetranscription factors whose binding site motif widths were 13 bp orshorter, we computed the radius of the Hamming ball that contained them.The results of these calculations, combined with the simulationsdescribed above for various Hamming balls, indicate that ourcombinatorial arrays will allow us to discern the relative bindingpreferences for most transcription factors whose DNA binding sites are 9bp or fewer.

We have found that the DNA binding site motifs that we have identifiedusing PBM technology can correspond well to the DNA binding site motifswe determined from an analysis of the genome-wide location analyses(ChIP-chip) performed by Lee et al. on these same transcription factors.Lee et al.: Transcriptional regulatory networks in Saccharomycescerevisiae. Science 2002, 298:799-804.

Synthesis of the Compact Combinatorial dsDNA Microarrays:

Double-stranded DNAs (dsDNAs) 60 bp in length can be synthesized asdescribed previously in Bulyk et al. (2001) Exploring the DNA-bindingspecificities of zinc fingers with DNA microarrays. Proc. Natl. Acad.Sci. U.S.A. 98:7158-7163. Briefly, oligos 60 bp in length can besynthesized (Illumina) such that they contain our compact combinatorialDNA design plus a universal primer sequence 16 bp in length. Twodifferent universal primer sequences can be used, so as to allow us toidentify and control for any specific binding either to a site withinone of the universal primers or a site resulting from the junction ofthe universal primer sequence with the variable (compact combinatorialdesign) region. These two different universal primers were designed torepresent as different as possible a set of k-mers, so as to ensure thatin case one of the universal primers happened to contain a binding sitefor a TF, that the other one probably would not also contain a bindingsite for that TF. Moreover we have designed two independent, completesets of compact combinatorial DNAs, and each one of these can be usedwith a different one of the two universal primers, so that in the endthere are a total of two unique sets of universal primer+compactcombinatorial DNAs. Each full-length oligo can be combined with itsuniversal primer in a 2:1 molar ratio in a primer extension reaction ina 96-well plate format employing the SEQUENASE™ V2.0 DNA polymerase(United States Biochemical). The completed primer extension reactionscan be purified away from the unincorporated universal primer and dNTPsand exchanged into printing buffer by using QIAQUICK™ filtration plates(Qiagen). Purified dsDNAs can be replicated into 5 daughter plates whichcan be used for printing the microarrays.

Before synthesizing the full set of compact combinatorial DNAs, we willfirst compare two different methods for attachment of dsDNAs to glassslides, with regard to the quality of the resulting PBM data. In thefirst method, the dsDNAs are end-attached to amine-reactive slidesthrough the use of amino-tagged universal primers, as describedpreviously. Alternatively, unmodified dsDNAs can be spotted onto GAPSII™ slides (Corning), and then covalently attached to the slides viaUV-crosslinking. Specifically, the microarrays can be rehydrated andthen UV-crosslinked in a STRATALINKER™ (Stratagene). End-attachment ofthe dsDNAs should allow the DNAs to not be kinked and to be maximallyaccessible for interaction with DNA binding proteins. However, the useof unmodified dsDNAs would allow us to use GAPS II™ slides instead ofamine-reactive slides which are extremely moisture-sensitive. It ispossible that a UV-crosslinking method could be developed that wouldoptimize the two opposing issues of: (1) ensuring that the DNA structureis as unperturbed as possible, e.g., ideally most DNA molecules willhave just one crosslink to the slide surface; (2) ensuring that mostspotted DNA molecules can be attached to the slides.

TF Cloning, Expression, and Purification:

We have successfully used TFs epitope-tagged with the FLAG tag andseparately with GST in PBM experiments. The epitope tag serves a dualpurpose; it allows for both purification and labeling with a respectivefluorophore-conjugated anti-tag antibody (such as Alexa488-conjugatedanti-GST antibody). E. coli is a robust, convenient, and inexpensiveexpression system for the production and purification of epitope-taggedhuman proteins, including TFs. Braun et al. Proc. Natl. Acad. Sci.U.S.A. 2002, 99:2654-2659. DNA binding proteins can be cloned intoGATEWAY™ (Invitrogen) entry clones and then transferred into GATEWAY™compatible expression vectors. After complete sequence verification ofthe entry clones, the genes can be moved into the pDEST15 expressionvector, which provides an N-terminal GST fusion tag. Subcloning of thesegenes can be performed by using standard GATEWAY™ recombinationalsubcloning methodology. High efficiency competent cells of DH5alpha andBL21(DE3)pLysS strains of E. coli can be used for transformations.Expression is preferably in BL21(DE3)pLysS; alternatively, fusionproteins can be expressed in yeast or insect cells, or by in vitrotranscription and in vitro translation using either an E. coli lysatesystem or a rabbit reticulocyte lysate system (see below). Proteins canbe expressed, for example, as full length proteins or as fragments,e.g., a fragment that includes the DNA binding domains. Since many knownand predicted yeast and fly TFs are fairly large (>80 kDa), we may haveto express and purify just the DNA binding domains of a number of theseTFs.

Proteins expressed in E. coli can be tested for solubility and purity bydenaturing gel electrophoresis. It is expected that all the expressedproteins may not be obtained in the soluble fraction in our initialexpression attempts. Induction at lower temperature (18° C.) can beattempted for proteins which are not obtained in the soluble fraction.Slower growth of bacterial cells leads to lower expression levels of theover-expressed protein, which may alleviate aggregation of the expressedprotein into inclusion bodies and help us to obtain a greater percentageof the protein in the soluble fraction. Conditions such as the use oflower concentrations of IPTG will also be tested. Should proteinsolubility be an issue, it is possible to modify purification protocolsby using detergents or by using different bacterial strains. Use ofmutant bacterial strains for expressing fastidious proteins has beenreported previously. See, e.g., Miroux et al. J. Mol. Biol. 1996,260:289-298. Purification of the GST-fusion proteins can be performedaccording to standard protocols. Briefly, GST-fusion protein can bebound to glutathione agarose beads, washed, and then eluted with reducedglutathione.

It is also possible to produce the proteins in vitro, e.g., using acell-free E. coli extract. In vitro translation can be performed, e.g.,in microtitre wells, or even directly on an array, e.g., for highthroughput screening purposes. After translation, the entire array canbe washed to remove the translation extract and unbound protein.

Protein Binding Microarray (PBM) Experiments and Data Analysis:

Protein binding reactions and subsequent labeling can be performedessentially as described in Preliminary Studies. Before synthesizing the2 unique sets of compact combinatorial DNAs, we will first synthesizejust 1 set (with just 1 universal primer sequence), so as to determinehow well we can extract TFBS motifs from the combinatorial array PBMdata. PBM experiments and SYBRGREEN™ I (SGI) staining of thecombinatorial dsDNA microarrays can be performed in triplicate. Allmicroarrays used for a given TF can be from the same print run, so as tominimize variation. We typically scan the labeled, protein-boundmicroarrays and the SGI-stained microarrays not just once but rathermultiple times, at a range of different laser power intensities (or PMTgain settings). We typically scan at ˜3-6 different laser powerintensities (or PMT gain settings) per microarray; this allows us tocapture signal intensities for even very low signal intensity spots,while ensuring that we capture sub-saturation signal intensities foreach of the spots on the microarray. Microarray TIF images can bequantified with GENEPIX™ microarray analysis software.Background-subtracted median intensities can be calculated using themedian local background surrounding each spot, in order to account foruneven distribution of background fluorescence over the slide surface.We can use MASLINER™ (MicroArray Spot LINEar Regression) software tocalculate the relative signal intensities over all the laser power (orPMT gain) settings in a semi-automated fashion. Specifically, MASLINER™combines the linear ranges of multiple scans from different scannersensitivity settings onto an extended linear scale. Dudley et al. Proc.Natl. Acad. Sci. U.S.A. 2002, 99:7554-7559. We can calculate thefractional signal intensity of each spot, relative to the total signalintensity on the microarray. To control for variation in DNAconcentration, we normalize each spot's PBM fractional signal intensityby its SGI fractional signal intensity.

The data from these PBM experiments using the compact combinatorial DNAmicroarrays can be analyzed essentially the same way as we currentlyanalyze our yeast intergenic microarray PBM data. Briefly, after runningMASLINER™ on multiple scans, the microarray data can be filtered so thatonly data from high quality spots are retained. To achieve this, anumber of criteria can be applied to ensure that only high quality spotscan be analyzed. First, for each of the triplicate microarrays, datacorresponding to any spots that are flagged “Bad” (e.g., dust flecks,etc.) can be removed. Data from each of three triplicate microarrays canbe normalized according to total signal intensity, so that the averagespot intensity is the same for all three slides. Then, within eachindividual slide, the data can be separated into sectors, e.g.,according to their local region on the slide. In one example, we haveorganized the spots into the 32 subgrids of the printed microarray. Thedata will then be normalized again so that the average spot intensity isthe same over all the sectors; this serves to normalize for anyregion-specific non-homogeneity in the background and also binding andlabeling reactions. Any spots with SD/median greater than 2.0, e.g.spots with highly variable pixel signal intensities, can be filteredout. The background-subtracted, normalized signal intensities for allspots with reliable data in at least two of the three replicatemicroarray experiments can be averaged, and the SD/mean can becalculated. The SGI microarray data are treated exactly the same way,except that any spots with fewer than 50% pixels with signal intensitiesgreater than (median background intensity+2 SDs) are also filtered out,as these spots presumably do not have enough DNA present to be able tomeasure accurate signal intensities.

For each spot, we calculate In (mean PBM/mean SGI) and create a scatterplot of the log ratio versus the spots' SGI intensities. Although weexpect that the log ratio should be independent of DNA concentration, wehave found that higher concentrations, as determined by higher SGIsignal intensities, seem to exhibit proportionately less bound protein.In order to restore the independence of log ratio and SGI intensity, thescatter plot is fit with a locally weighted least squares regressionusing the LOWESS function of the R statistics package (smoothingparameter=0.5). We subtract the value of the regression at each spotfrom its log ratio, yielding a modified log ratio that is independent ofDNA concentration. We then plot the distribution of all log ratios as ahistogram (bin size=0.04), which resembles a Gaussian with a heavy tail.We determine the mode of the distribution by searching for the window offive bins with the highest number of spots and taking the middle bin. Wethen reflect all values less than the mode and fit these values to aGaussian function using the MATHEMATICA™ software package (WolframResearch, Inc.). This gives the mean and standard deviation (SD) of thedistribution. We adjust all log ratios so that the peak is centered onzero. We then calculate a p-value for each individual spot based on thenumber of SDs its log ratio is above the mean of the Gaussiandistribution. In order to correct for multiple hypothesis testing, allindividual p-values are then adjusted to a modified significance levelusing a modified Bonferroni correction.

EXAMPLE

The appendices filed in U.S. Ser. No. 60/587,066 provide four actualexamples of sequence sets for combinatorial arrays. Each set representsa complete design, either for an all 9-mer design (k=9) or an all 8-merdesign (k=8). We can readily create all k-mer designs for any word (e.g.binding site) length k.

Arrays can be printed which include these sequence sets.Oligonucleotides for the arrays can further include an invariantsequence, such as an appended universal primer sequence, each 16 ntlong, to each of the 5′ ends of each of these sequences. The invariantsequence can be used for subsequent primer extension to createdouble-stranded DNAs from the single stranded oligonucleotide template.Examples of primers sequences that could be used include:TCAAGTCAATCGGTCC, ATCGCAGTTAGCAATG, and GGGTAGAGGGTTTCAA.

EXAMPLE

A nucleic acid array was prepared using a set of oligonucleotides thatcover all 8-mers (k=8). The oligonucleotides have a total sequencelength of 60. These sixty nucleotides include:

9 nucleotides as “linker” sequence that that spaces the universalpriming sequence from the slide surface (various sequences can be used,in this example use 9 T's);

24 nucleotides as the universal priming sequence;

27 nucleotides as the combinatorial sequence.

With this arrangement, one array for all 8mer was covered by 3277 spots.The array was made double stranded using a primer. Synthesis of thesecond strand from the primer was confirmed on one array byincorporation of ALEXA®488-dUTP. Primer binding could also be confirmedusing a Cy5-labelled primer (24 nucleotides in length).

Protein binding to an array of this design was evaluated using anALEXA®488-labelled anti-GST antibody as follows. The yeast transcriptionfactor CBF was expressed as a GST fusion protein and bound to thedouble-stranded array. Location of the protein was detected using theALEXA®488-labelled anti-GST antibody. Imaging of the array indicatedthat only particular locations were bound by the CBF protein. Ananalysis of the ten brightest spots indicated the presence of bindingsites that match well to the known CBF binding site.

EXAMPLE

We have developed a new DNA microarray-based in vitro technology, termedprotein binding microarrays (PBMs), that allows rapid, high-throughputcharacterization of the DNA binding site sequence specificities oftranscription factors in a single day. Using PBMs, we identified the DNAbinding site sequence specificities of the yeast transcription factorsAbf1, Rap1, and Mig1. Comparison of these proteins' in vitro bindingsites versus their in vivo binding sites indicates that PBM-derivedsequence specificities can accurately reflect in vivo DNA sequencespecificities. In addition to previously identified targets, Abf1, Rap1,and Mig1 targeted about 100, 90, and 70 new target intergenic regions,respectively, many of which were upstream of previously uncharacterizedopen reading frames (ORFs). Comparative sequence analysis indicates thatmany of these newly identified sites are highly conserved across allfive sequenced sensu stricto yeast species and thus are likely to befunctional in vivo binding sites that potentially are utilized in acondition-specific manner. Similar PBM experiments will likely be usefulin identifying cis regulatory elements and transcriptional regulatorynetworks in various genomes.

The ability of an organism to adapt to its environment, respond toextracellular cues, and perform vital cellular functions requires thespecific and coordinated expression of thousands of genes. Theinteractions between transcription factors (TFs) and their DNA bindingsites are an integral part of their transcriptional regulatory networks.These interactions control critical steps during normal growth and inresponse to external stimuli. For yeast, although the genomes of severaldifferent species have been sequenced and analyzed^(1,2), much stillremains to be understood about how yeast genes are regulated.Significant progress has been made recently in the accumulation andanalysis of mRNA transcript profiles^(3,4), locations of in vivo bindingsites of TFs^(5,8), and protein-protein interactions⁹⁻¹². However, thereare still many TFs whose DNA binding specificities and regulatory rolesare unknown.

Protein binding microarray (PBM) technology allows the in vitro bindingspecificities of TFs to be determined in a single day by assaying thesequence-specific binding of TFs directly to double-stranded DNAmicroarrays spotted with a large number of potential DNA binding sites.A DNA binding protein of interest (e.g., a yeast TF) is expressed withan epitope tag, purified, and then bound directly to a double-strandedDNA microarray (here, a whole-genome yeast intergenic microarray). Theprotein-bound microarray is then washed gently to remove anynonspecifically bound protein, and labeled with a fluorophore-conjugatedantibody that is specific for the epitope tag.

Binding site data from PBMs on yeast TFs corresponded well with bindingsite specificities determined from ChIP-chip. Moreover, comparativesequence analysis of the PBM-derived binding sites in S. cerevisiae withthose in the orthologous positions in the sequence alignments of S.mikatae, S. kudriavzevii, S. bayanus, and S. paradoxus indicated thatmany of the sites bound in PBMs, including some not identified byChIP-chip, are highly conserved and thus are likely to be functional invivo binding sites that potentially are utilized in a condition-specificmanner. Since there are many known TFs and predicted DNA bindingproteins whose DNA binding specificities have not been characterized,our PBM technology will likely aid in the annotation of many regulatoryproteins and DNA sequence motifs, and it may allow them to be connectedto one another and to gene regulatory networks.

Results

Protein Binding Microarray Experiments

As a validation of this approach, we bound CBP-FLAG-Rpn4 fusion proteinto microarrays spotted with a number of positive and negative controlspots for binding by Rpn4. We labeled the protein-bound array withCy3-conjugated M2 anti-FLAG primary antibody (Sigma), and scanned itwith a microarray scanner (GSI Lumonics SCANARRAY™). Only the spots thatcontain good matches to the binding site motif for Rpn4 exhibit highsignal intensity. As we previously found that higher signal intensity isin general indicative of higher DNA-protein binding affinity¹⁴, thisCBP-FLAG-Rpn4 PBM indicates that our new PBM technology is successful inidentifying sequence-specific TF binding.

We applied the PBM technology on a genome-wide scale by usingwhole-genome yeast intergenic arrays in PBM experiments to identify thesequence specificities and target genes of three yeast TFs: ARS-bindingfactor 1 (Abf1), repressor-activator protein 1 (Rap1), and Mig1. Abf1has a CHC2 zinc finger DNA binding domain, binds origins of replication,and is a transcriptional silencer²⁰. Interestingly, binding of Abf1induces a DNA bend of approximately 120°²¹. Rap1 binds to DNA via aMyb-like helix-turn-helix DNA binding domain and, among other roles,regulates telomere length and expression at the silent mating-type lociHML and HMR²². Mig1 has two C2H2 zinc fingers in its DNA binding domain,and it is involved in the repression of glucose-repressed genes²³.

The whole-genome yeast intergenic arrays were spotted with essentiallyall of the known intergenic regions in the yeast genome⁵; the lengths ofthe 6723 unique spotted PCR products ranged from ˜60 to ˜1500 bp, withan average length of ˜480 bp. We used Abf1, Rap1, and Mig1,dually-tagged at the N-terminus with glutathione S-transferase (GST) andHis₆, in PBM experiments using these microarrays. The washed,protein-bound microarrays were labeled with Alexa 488-conjugatedanti(GST) antibody (see Methods), and then scanned with a microarrayscanner. The microarray TIF images were quantified using GenePix Proversion 3.0 software (Axon Instruments, Inc.) (see Methods). In onecase, a whole-genome yeast intergenic microarray was used in a PBMexperiment with the yeast TF Rap1. Negative control PBMs performed onrecombinant GST (Sigma) and GST-His₆ purified from yeast did not resultin sequence specific DNA binding, indicating that there is no specificDNA binding conferred by these epitope tags. In addition, we performednegative control PBMs with calmodulin (Cmd1), which is not a DNA bindingprotein. As expected, the GST-His₆-Cmd1 PBMs also did not show evidenceof sequence specific DNA binding. Thus, we expect sequence-specificbinding in our PBM experiments only from sequence-specific DNA bindingproteins. For each TF, the experiments were performed in triplicate. Wefound that the PBM data are highly reproducible, with most spots havingSD/mean values less than 0.3. In general, the spots with the greatestvariation are the ones with the lowest signal intensities, and thus wereleast likely to enter into our motif finding analysis.

In order to allow for the normalization of the PBM data by relative DNAconcentration, we stained separate microarrays from the same microarrayprint run with SYBRGREEN™ I (Molecular Probes), which is specific fordouble-stranded DNA. Normalization by relative DNA concentration isimportant, since there can be significant variation in the amount of DNApresent in each spot due to variation in the concentration of the DNAsthat are spotted. We evaluated the distribution of the log ratios ofmean PBM to mean SYBRGREEN™ I signal intensities for the set oftriplicate Rap1 PBM experiments. We interpret the distribution of spotson the left, which is fit well by a Gaussian, to correspond to the spotsthat are bound nonspecifically. Conversely, we interpret the heavy uppertail of the distribution to correspond to spots bound specifically bythe TF. For each spot we calculate a p-value for specific binding basedon the number of standard deviations the spot's log ratio was above themean of the Gaussian distribution, taking into account the width of thatdistribution (see Methods).

We applied the Modified Bonferroni Method to correct for multiplehypothesis testing in order to further increase our confidence in whatwe identify as ‘bound’ spots. We determined the numbers of unique spotsthat pass a 0.05, 0.01, or 0.001 p-value threshold for the PBM data ofAbf1, Rap1, or Mig1. We have used a stringent p-value threshold of 0.001even though we anticipate that it may increase our false negative rate,so as to increase the likelihood that spots passing our p-valuethreshold correspond to true positives. The spots with high log ratioPBM data correspond to the intergenic regions upstream of RPL14A, RPL8A,and OPI3, all of which are known to be direct targets of Rap1.

Identification of DNA Binding Site Motifs

We expected to be able to determine the DNA binding site specificity ofa TF from analysis of the sequences with the highest PBM/SYBRGREEN™ Ilog ratios. We analyzed the sequences corresponding to those spots witha Bonferroni-corrected p-value of less than 0.001 with the motifdiscovery program BioProspector²⁴ in order to identify the TF's DNAbinding site motif. Analysis of these sequences for over-represented DNAsequence motifs produced the Rap1, Abf1, and Mig1 binding site motifs.As can be seen, the PBM technology allows the identification of bothungapped (e.g. Rap1 and Mig1) and gapped (e.g. Abf1) DNA binding sitemotifs. Moreover, we were able to derive the binding site motif for Mig1from the PBM data, whereas analysis of ChIP-chip data⁸ did not result inthe Mig1 binding site motif. The group specificity score^(25,26) (seeMethods) for each of the PBM motifs (p=1.1×10⁻²²² for Rap1; p=1.7×10⁻¹⁵¹for Abf1; p=1.4×10⁻⁶⁵ for Mig1) was also evaluated. Compared to thecorresponding computational negative controls on matched sets ofrandomly selected intergenic regions (see Methods), the groupspecificity scores derived from the PBM data for each of these three TFsare extremely significant. Thus, we have confidence that the PBM datarepresent true sequence-specific binding of TFs to their DNA bindingsites.

The PBM data are based on an assay of all possible binding sites in thenoncoding portion of the S. cerevisiae genome. To confirm and furtherexplore the high resolution binding site data generated from PBMs, weperformed electrophoretic mobility shift assays (EMSAs). EMSAs involvingan intergenic region that was bound significantly by Rap I in PBMsindicated the superiority of the PBM motif over the TRANSFAC motif inidentifying high affinity TF binding sites. Specifically, the intergenicregion iYPL221W contained a highly significant match (GGTGCACGGATT) tothe Rap1 binding site motif and scored significantly in PBMs(p=3.9×10⁻²¹), but it was a poor match to the TRANSFAC™ Rap1 motif. Useof the TRANSFAC™ motif would predict the underlined nucleotides to beunfavorable for Rap1 binding, whereas these nucleotides are tolerated inthe PBM motif. Our EMSA analysis of the sequence consisting of this Rap1binding site, flanked by its native intergenic sequence, indicated thatRap1 is capable of high affinity binding to this sequence. This is anexample of a high affinity TF binding site.

To approximate the potential false positive rate from PBMs, wedetermined the fraction of spots passing a 0.001 p-value threshold thatwere not identified by BIOPROSPECTOR™²⁴ as containing a sequencebelonging to the given TF's binding site motif. This is by no means aperfect measure; for example, some of these potential false positivescould simply have either weaker matches to the identified motif orperhaps have multiple occurrences of lower affinity sites that were notidentified by the motif finder as belonging to the motif that representsthe majority of the identified binding sites. For example, theintergenic region iYLL051C, which passed the 0.001 p-value threshold buthad only a weak sequence match to the Rap1 PBM motif, was confirmed byEMSAs to be bound by Rap1 in vitro. This finding indicates that someintergenic regions may contain high affinity binding sites that are notstrong sequence matches to the given TF binding site motif. Thus, it ispossible that the false positive rate is lower than the one we haveestimated. Nevertheless, using this approximate measure of falsepositives, we found that our false positive rates ranged fromapproximately 7% to 9% of ‘bound’ spots. Importantly, for all three TFs,a much larger number of potentially real binding sites was identifiedfrom the PBM data as compared to the number identified from ChIP-chipexperiments.

Comparison of PBM Data Versus Chromatin Immunoprecipitation Data

DNA binding site motifs identified with the PBM technology for Abf1 andRap1 corresponded well to motifs determined from analysis of ChIP-chipdata passing a 10⁻³ p-value threshold for these same TFs^(7,8). Thegroup specificity score^(25,26) for each of the ChIP-chip motifs wasless significant than was the group specificity score for the respectivePBM motifs, indicating that not all binding site sequence matches wereidentified as bound in the ChIP-chip experiments. Venn diagrams wereused to compare the PBM data for Rap1, Abf1, and Mig1, respectively,with the ChIP-chip data for these same TFs^(7,8). Of the ˜6400, ˜6100,and ˜6400 unique intergenic PCR products that passed our various PBMdata quality control filters for Rap1, Abf1, and Mig1, respectively,previously published ChIP-chip data^(7,8) was also available for 99.9%,93.1%, and 93.7% of these intergenic regions. For both Rap1 and Abf1,the intergenic regions identified as bound by PBMs overlapped with amajority of the regions identified as bound by ChIP-chip, while certainintergenic regions were identified as bound either just by PBMs or justby ChIP-chip. Unlike for Rap1 and Abf1, the intergenic regionsidentified as bound by Mig1 in PBMs overlapped with only a small numberof the regions identified as bound by ChIP-chip. Furthermore, many fewerregions were identified as bound in ChIP-chip as compared to PBMs. SinceMig1 is known to be regulated at the level of nuclear localization²³, itis possible that the yeast cultures for the ChIP-chip experiments weresuch that Mig1 may have been predominantly cytoplasmic. Thus, inaddition to previously identified targets, we have identified 103, 86,and 69 new target intergenic regions for Abf1, Rap1, and Mig1, includingthose upstream of 25, 40, and 29 previously uncharacterized open readingframes (ORFs), respectively.

There are many biological reasons why certain intergenic regions wouldbe bound either only in vitro or only in vivo. For example, atrans-acting factor adjacent to a TF binding site could prevent a TFfrom binding in vivo, or local chromatin structure could provide anon-permissive environment for binding. Conversely, a TF couldindirectly interact with a particular intergenic region in vivo, in theabsence of its own DNA binding site, through protein-proteininteractions. In order to attempt to find evidence for suchco-regulatory mechanisms, for each TF we searched each of these sets ofintergenic regions for secondary DNA sequence motifs that might beinvolved in such co-regulation. A secondary motif specific for theregions bound only in vivo might correspond to a binding site for eithera protein cofactor or a recruiting protein that allows the target TF tocontact the DNA through an indirect interaction. Alternatively, asecondary motif specific for the regions bound only in vitro mightcorrespond to a binding site for a separate factor that blocks access tothese DNA binding sites under the conditions in which the yeast cultureswere grown for the ChIP-chip experiments. We did not find any secondarymotifs that achieved statistical significance, potentially because ofthe abundance of different modes by which binding of TFs to DNA isregulated in vivo. However, it is possible that such secondary motifsmight exist for TFs not studied here.

Conservation of Identified Binding Sites Across All Five Sensu StrictoYeast Genomes

To find evidence supporting our hypothesis that the regions bound onlyin vitro are functional in vivo but for some specific biological reasonhave not been identified previously as bound, we mapped the predictedbinding sites in S. cerevisiae to the orthologous positions in thesequence alignments of the S. mikatae, S. kudriavzevii, S. bayanus, andS. paradoxus genomes, which are the four other sequenced yeast genomesof the yeast sensu stricto clade^(1,2). To examine binding siteconservation across the five sensu stricto species, we consideredbinding site matches two different ways (see Methods). In the firstapproach, we called a site conserved if the orthologous sequence in allfive species was within two standard deviations of the motif average²⁸that we determined from the set of regions passing a 0.001 p-valuethreshold in PBMs. In the second approach, we employed a very strictmeasure of sequence conservation, in which we required 100% sequenceidentity at all 9, 9, or 12 nucleotide positions of the TF binding site(for Mig1, Abf1, and Rap1, respectively), in all five species. We foundthat the level of conservation varied for these three TFs, but that ingeneral for the regions passing the 0.001 Bonferroni-adjusted p-valuethreshold the binding sites within regions bound in PBMs were just aslikely to be conserved as were the binding sites within regions bound inChIP-chip. Furthermore, the regions bound in PBMs and not in ChIP-chipshowed the approximately the same degree of conservation. PBMexperiments identified 23, 70, and 38 binding sites for Mig1, Abf1, andRap1, respectively, that were conserved within two standard deviationsin all five species and that were not identified as ‘bound’ in ChIP-chipexperiments^(7,8). Moreover, the regions bound only in PBMs identifiedmany conserved sites that are 100% identical across all five species.Based upon the known conservation level across the sensu strictogenomes², the probability of observing even a single binding site thatis 100% conserved by chance is extremely small. However, for each TFbetween six and ten new sites were found that were 100% identical in allfive species and bound only in PBMs. Furthermore, this is actually anunderestimate of the degree of conservation among TF binding sitesbecause our search was limited to genomic regions that are aligned inall five species and does not consider binding sites that may beconserved in four or fewer species. Thus, we believe that the intergenicregions bound in PBMs are very likely to contain functional in vivobinding sites.

Identification of Target Genes

We examined each of the sets of intergenic regions bound in PBMs todetermine whether the candidate target genes, located directlydownstream of the bound intergenic regions, were over-represented forparticular functional groups of genes^(25,26) (see Methods). Among thesignificantly enriched categories for the target genes derived from theRap1 PBM data, a large number are consistent with the known regulatoryfunctions of Rap1²⁹, including the MIPS functional classificationcategories for ribosome biogenesis (p<1.0×10⁻¹⁴), protein synthesis(p<1.0×10⁻¹⁴), structural constituents of the ribosome (p<1.0×10⁻¹⁴),and cell growth and/or maintenance (p=3.5×10⁻¹²). Of note, the Rap1 PBMdata identified a greater number of target genes that are annotatedmembers of each of these functional categories than did the combinationof the two published ChIP-chip datasets for Rap1. For example, PBMexperiments identified eight genes involved in ribosome biogenesis,including two encoding mitochondrial ribosomal proteins, YmL8 and YmL15,whose putative TF binding sites were below the threshold for binding inboth published ChIP-chip datasets^(7,8). Consistent with thesefunctions, the deletion strains of the target genes derived from theRap1 PBM data are enriched for having a slow growth phenotype(p=1.0×10⁻⁴). As further support of their functional significance, manyof the corresponding enriched target genes from the PBM data had Rap1sites that were conserved across all 5 sensu stricto yeast species.Among the newly identified Rap1 putative target genes, 40 werepreviously uncharacterized, including the ORFs YDR109C, YKL]51C,YIL001W, and YKL082C.

Further characterization of these targets may identify heretoforeunknown biological functions for Rap I, both expected and unexpected.YDR109C shows strong homology (BLAST E-value=2.0×10⁻⁹⁶) to a number ofribulose kinases, and YKL151C shows strong homology (BLASTE-value=1.0×10⁻⁴⁴) to a carbohydrate kinase family, suggesting amechanism by which Rap1 might connect the nutrient status of a cell withits translational capacity. YIL001W shows strong homology (BLASTE-value=4.0×10⁻²⁶) to human elongation factor 1A binding protein,implying its likely role in protein synthesis. YKL082C, whileuncharacterized, is thought to encode a nucleolar protein that isrequired for normal pre-rRNA processing and is involved in theestablishment of cell polarity. Interestingly, gene expression ofYKL082C clusters with that of several Rap1 targets identified in PBM andChIP-chip experiments, including RPS27A (encodes a ribosomal protein),UBP10 (involved in telomeric silencing), as well as BUD22 and BUD27 (budsite selection)³⁰. BUD27 is also involved in gene expression controlledby the TOR kinase, which is known for its role in transducing theavailability of nutrients into growth and ribosome synthesis.Importantly, all four of the above uncharacterized ORFs are downstreamof Rap1 binding sites that are conserved across all five sensu strictoyeast species.

The significantly enriched categories for the target genes derived fromthe Abf1 PBM data are also consistent with the known regulatoryfunctions of Abf1²⁹, including the GO biological process categories forcell growth and/or maintenance (p=1.6×10⁻⁷), cell organization andbiogenesis (p=8.1×10⁻⁶), and essentiality (p=1.2×10⁻⁴). Consistent withthese functions, the deletion strains of the target genes derived fromthe Abf1 PBM data are also enriched for having a slow growth phenotype(p=5.2×10⁻⁴). Among the Abf1 candidate target gene categories identifiedin this study that were not previously identified as targets byChIP-chip analysis, there was an enrichment for the MIPS subcellularlocalization functional category of the mitochondrial outer membrane(p=6.8×10⁻⁴), the MIPS protein complex functional category for themitochondrial translocase complex (p=6.3×10⁻⁵), and the GO BiologicalProcess functional categories of nucleic acid metabolism (p=1.1×10⁻⁵)and protein metabolism (p=1.2×10⁻⁵). While Abf1 is known to be a globalregulator, we could not find any prior evidence in the literature forits role in regulating proteins of the mitochondrial membrane ortranslocase complex. Consistent with these functions, the deletionstrains of the target genes derived from the Abf1 PBM data are alsoenriched for having a slow growth phenotype (p=5.2×10⁻⁴). In all, weidentified 25 novel uncharacterized putative target genes of Abf1 forwhich ChIP-chip experiments did not show enrichment, approximately halfof which are downstream of Abf1 sites conserved across all five sensustricto species. Of note, YHR020W shows homology (WU-BLAST2E-value=0.045) to Mst1, a mitochondrial threonine-tRNA synthetase, andto a Drosophila glutamyl-prolyl-tRNA synthetase (BLAST E-value=10⁻¹⁷²).YHR020W previously has been shown to be co-expressed with several otherputative Abf1 targets involved in protein and nucleic acid metabolism,including URA7 (pyrimidine biosynthesis), ADE5,7 (purine biosynthesis),and SAM1 (adenosylmethionine synthetase)³⁰. Our results suggest thatYHR020W is likely to be a true target of Abf1 and to play a role incellular biogenesis. These observations are consistent with the knownregulatory roles of Abf1 in cell growth and maintenance and underscorethe ability of PBMs to identify novel, uncharacterized targets of evenwell-studied TFs.

A much more complete picture of the regulatory functions of Mig1 waspossible from analysis of the PBM target genes than could be derivedfrom the ChIP-chip data. Among the enriched functional categories werethose for C-compound and carbohydrate metabolism (p=2.7×10⁻¹³),C-compound, carbohydrate transporters (p=6.7×10⁻⁸), hexose transport(p=9.5×10⁻⁹), and alcohol metabolism (p=2.2×10⁻⁵), all of which areconsistent with the known regulatory function of Mig1 as atranscriptional repressor of genes whose products are dispensable in thepresence of high levels of glucose²³. Many of the functional categoriesand corresponding target genes did not result from analysis of theChIP-chip data, supporting our hypothesis that the Mig1 ChIP-chipexperiments⁸ were not performed on cultures grown in ideal conditionsfor Mig1 to be regulating its target genes. We identified many novelputative target genes for Mig1, 29 of which were previouslyuncharacterized, including the ORFs YNR071C, YIL024C, YLR089C, YOR356W,and YLR072W.

Further examination of these uncharacterized putative Mig1 targetsprovided evidence suggestive of their roles in expected pathways andprocesses. YNR071C shows strong homology (BLAST E-value=5.3×10⁻⁸⁹) toGal10, which has a key role in galactose metabolism, making this gene aplausible target for Mig1 repression in the presence of glucose. Inaddition, YNR071 C is up-regulated 6.9-fold in a TOM6 deletion strain³¹;Tom6 is part of a protein complex involved in outer membranemitochondrial translocation. These results suggest a potentialrelationship between YNR071C and Tom6 that couples galactose metabolismwith aerobic respiration. Another new target, YIL024C, has homology,albeit low (WU-BLAST2 E-value=0.094), to Sip2, a member of a family ofproteins that interact with Snf1 and Snf4 and that are involved in theresponse to glucose starvation³². Notably, the predicted Mig1 bindingsite upstream of YIL024C is conserved (2 SDs position weight matrix(PWM) match) across all 5 sensu stricto yeast species. Two more noveltargets, YLR089C and YOR356W, both encode proteins that are localized tothe mitochondria³³ and are likely to be important in the oxidation offuel molecules. YLR089C shows homology (WU-BLAST2 E-value=2.6×10⁻⁶) toBna3, which is involved in NAD biosynthesis, and to alanineaminotransferases in species ranging from plants (BLAST E-value=10⁻²²²)to human (BLAST E-value=10⁻¹¹⁶) Alanine aminotransferases are pyridoxalenzymes that catalyze the reversible transamination between alanine and2-oxoglutarate to form pyruvate and glutamate. By mediating theconversion of these four major intermediate metabolites, thesetransaminases have roles in gluconeogenesis and in amino acidmetabolism. YOR356W shows strong homology (BLAST E-value=10⁻¹⁵⁹) to ahuman electron transfer flavoprotein-ubiquinone oxidoreductase. BothYLR089C and YOR356W are immediately downstream of Mig1 binding sitesthat are 100% identical at all nine positions across all five sensustricto species. Finally, in addition to the role of Mig1 in regulatingcarbohydrate and amino acid metabolism, our results also indicate apossible role in cholesterol biosynthesis. Mig1 shows homology (BLASTE-value=7×10⁻¹⁰) to the human zinc finger TF WT1, a tumor suppressorgene mutated in Wilms' tumor. WT1 has been implicated in repression ofthe mevalonate pathway, which is central in cholestorol biosynthesis³⁴.Similarly, YLR072W shows homology (WU-BLAST2 E-value=4.1×10⁻⁴) to Atg26,a sterol 3-beta glucosyl transferase involved in sterol metabolism.

In addition to examining the features of individual putative targetgenes for each TF, we investigated whether the collective group oftarget genes showed concerted expression in particular experimentalconditions. Because different culture conditions often stimulatedifferent cellular responses and coordinate changes in transcriptionalregulation, the success of ChIP-chip experiments hinges on choosingthose conditions for which the TF is expressed and active. PBMs,however, are free of this constraint and can identify DNA binding sitemotifs and putative target genes irrespective of culture conditions.Analysis of these putative target genes can further be used to suggestin vivo conditions for optimal TF activity. Datasets from 643 yeast geneexpression microarray experiments performed under various conditionswere examined to identify particular conditions in which a significantfraction of PBM target genes were differentially expressed (seeMethods). Not surprisingly, the largest fraction of Rap1 targets wasup-regulated in conditions optimized for ribosomal biogenesis; e.g.,those that confer minimal stress, and those that do not correspond togene deletions³⁵. Furthermore, significant fractions of the Rap1 targetgenes were down-regulated in a variety of stress response conditions,which are known to involve the repression of ribosomal proteins andprotein synthesis³⁵. Abf1 target genes show no significant enrichment ordeficit in expression in the examined set of conditions, possiblyindicating that the optimal conditions for Abf1 binding have not yetbeen profiled with microarrays. Alternatively, Abf1 target genes may becoregulated by a diverse set of condition-dependent factors, such thatno single condition leads to broad correlated expression of the targetgenes. Mig1 target genes identified by PBM analysis were significantlydown-regulated in expression experiments in which the preferred sugarsglucose, fructose, and sucrose were the primary carbon sources. This isin agreement with the known role of Mig1 as a repressor of genesinvolved in alternate carbon source metabolism²³. Several cultureconditions were identified in which Mig1 target genes were up-regulated,despite its role in glucose repression, indicating the absence orinactivation of Mig1. Indeed, many of the most significant conditions(e.g., those in which a significant fraction of PBM target genes weredifferentially expressed) involved stationary phase cultures; stationaryphase onset has been associated with derepression by Mig1³⁶. Takentogether, these results show that in conjunction with expressionprofiling, PBM analysis has the potential to provide insight into thefunctions of particular TFs and to identify conditions in which they areactive in vivo.

This PBM technology allows rapid, high-throughput characterization ofthe DNA binding site sequence specificities of TFs in a single day, andthus can connect TFs to the genes they regulate. In addition toidentifying enriched functional categories of known and newly discoveredtarget genes, we also identified many uncharacterized ORFs as candidatetarget genes of Rap1, Abf1, and Mig1. Moreover, the Mig1 data highlightone scenario in which PBM data may be particularly valuable; e.g.,ChIP-chip experiments require that the cells be maintained in cultureconditions in which the TF of interest is expressed and nuclear. Ascould be seen for Mig1, PBM experiments will be particularly useful whenChIP-chip experiments do not result in enough enrichment of boundfragments in the immunoprecipitated sample to permit identification ofthe DNA sites bound in vivo. Moreover, integrating an epitope tag on thegenomic copy of the TF, which allowed the use of a single antibody inthe 106 ChIP-chip experiments performed by Lee et al.⁸, is not astrivial in many other organisms as it is in yeast; instead in generalone has to rely upon protein-specific antibodies that are both specificand successful in chromatin immunoprecipitation, and the generation ofsuch antibodies is not a trivial undertaking.

It is possible that the PBM and ChIP-chip binding site motifs will notcorrespond so closely for all proteins. Such differences may help us toidentify either whether there are significant in vivo effects due toeither chromatin structure or cofactors important for sequence-specificbinding. Even though the DNA in PBM experiments is not in the same stateas it might be if it were to be bound by the TF in vivo, results fromPBM experiments can provide valuable data on the sequence specificity ofTFs, particularly those which have been poorly understood oruncharacterized thus far. Performing ChIP-chip experiments on yeastgrown under a variety of different culture conditions will help toconfirm our predictions that particular sets of newly identified bindingsites are indeed bound in vivo. Furthermore, the combination of PBM datawith mRNA expression data^(3,4), ChIP-chip⁵⁻⁸, protein-proteininteraction data^(9,10,12) and prior genetic and biochemical data in theliterature will contribute towards more detailed models of and a morethorough understanding of gene regulatory networks in yeast³⁷.

The data presented here indicate that the PBM approach works for TFswith DNA binding domains of a number of different structural classes,including those whose binding induces DNA bending²¹. Since PBMexperiments are highly scalable, they could readily be adapted for theanalysis of all possible DNA sequence variants. Similarly, there arehundreds of predicted DNA binding proteins in yeast and thousands ofpredicted TFs in other genomes that could be screened forsequence-specific binding by PBM experiments. For example, a microarrayconsisting of roughly 30,000 spots of 1 kb sequence, enriched for theportions of the Drosophila melanogaster or human genome likely tocontain regulatory elements, could be examined by PBMs to characterizethe DNA binding specificities of over 670 D. melanogaster TFs³⁸ or ofthe ˜2000 human TFs^(39,40). Since dozens of PBM experiments could beperformed in parallel in a single day, this technology providessignificant cost and time advantages over other methods, which can takemonths to measure the effects of mutations for a large set of variantDNA-protein interactions.

The effects of different concentrations of TFs as well as of proteincofactors, protein modifications, small molecule cofactors such asmetabolites, or various binding conditions could be measured with PBMs.PBMs could be used to distinguish the relative binding preferences ofvarious whole or partially fractionated cell lysates, such as fromvarious cell types, sampled at different time points or grown underdifferent conditions.

Bioinformatic analysis of PBM will provide more informative data than amononucleotide PWM, as it has been shown previously that nucleotides ofTF binding sites frequently do not act independently in binding byTFs^(15-17,41). Moreover, as more PBM experiments are performed, thevast datasets that would be generated on DNA-protein interactions couldyield the necessary data required to determine what predictive rules mayexist that describe DNA recognition by sequence-specific TFs⁴².

Methods

Synthesis of DNA Microarrays

Microarrays spotted with double-stranded DNAs containing either positiveor negative control binding sites for Rpn4 for the PBMproof-of-principle experiments with CBP-FLAG-Rpn4 were synthesizedessentially as described previously¹⁴. Whole-genome yeast intergenicmicroarrays were synthesized essentially as described previously⁵.

Expression and Purification of Yeast Transcription Factors

N-terminal CBP-FLAG fusions of RPN4 were created by cloning RPN4 intothe pCAL-n-FLAG™ vector (Stratagene). The resulting CBP-FLAG-RPN4 fusionconstructs were full-length sequence-verified to ensure that nomutations had been introduced during cloning. Verified CBP-FLAG-RPN4constructs were transformed into BL21-GOLD™(DE3)pLysS E. coli(Stratagene) and expressed by inoculating LB medium containing 50 uMzinc acetate and 50 ug/ml carbemicillin with an overnight culture (1:20dilution), growing at 30° C. to an OD₆₀₀ between 0.3 and 0.5, and theninducing with 1 mM IPTG until an OD₆₀₀ of 1.6 was reached. Cell pelletswere stored at −80° C., and subsequently were thawed on ice and lysedwith CELLYTICB™ Bacterial Cell Lysis Extraction Reagent (Sigma)containing 50 μM zinc acetate. The CBP-FLAG fusion proteins werepurified with anti-FLAG M2 affinity gel (Sigma), and subsequentlyquantified. Sequence-specific binding of the purified Rpn4 fusionprotein was verified with EMSAs using probes containing the consensusPACE site 26 Purified proteins were stored at −80° C. until use.

N-terminal GST-His₆ fusions of Rap1, Abf1, and Mig1 were producedessentially as described previously¹¹. Briefly, the fusion proteins wereexpressed in S. cerevisiae, and then individually purified withglutathione beads (Amersham), concentrated using Microcon YM-30™ filters(Millipore), and subsequently quantified. Purified proteins were storedat −80° C. until use.

Protein Binding Microarray (PBM) Experiments

PBM experiments and SYBRGREEN™ I staining of the DNA microarrays wereperformed in triplicate, essentially as described previously¹⁴. Briefly,previously purified proteins were thawed on ice and diluted to a 20 nMfinal concentration in a protein binding reaction mixture consisting ofPBS, 50 uM zinc acetate (ZnAc), 2% (w/v) nonfat dried milk, 0.3 ng/ulsalmon testes DNA (Sigma), and 0.2 μg/ul BSA; this protein bindingreaction was allowed to pre-incubate for 1 hr at room temperature.Microarrays were pre-wet in PBS/0.01% Triton-X-100 and then blocked with2% milk in PBS for 1 hr. The blocked microarrays were washed once withPBS/0.1% Tween 20, and then once with PBS/50 uM ZnAc/0.01% Triton X-100(PBS/ZnAc/TX100). The pre-incubated protein binding mixtures were thenapplied to the microarrays and binding was allowed to proceed for 1 hr.The microarrays then were washed once with PBS/ZnAc/0.5% Tween 20, andthen once with PBS/ZnAc/TX100. Alexa 488-conjugated rabbit anti(GST)polyclonal antibody (Molecular Probes) or Cy3-conjugated mouseanti(FLAG) M2 monoclonal antibody (Sigma) was diluted in PBS/ZnAccontaining 2% milk, pre-incubated for at least 30 min, and applied tothe microarray. After incubation for 1 hr, the microarrays were washed 3times with PBS/ZnAc/0.05% Tween 20, and once with PBS/ZnAc. The slideswere then spun dry, and stored in a closed box until being scanned.

Microarray Imaging and Data Analysis

All whole-genome yeast intergenic microarrays were from the same printrun, so as to minimize variation. We typically scanned (GSI LumonicsSCANARRAY™ 4000 or SCANARRAY™ 5000) the labeled, protein-boundmicroarrays and the SYBRGREEN™ I stained microarrays at 3-6 differentlaser power intensities or PMT gain settings per microarray; thisallowed us to capture signal intensities for even very low signalintensity spots, while ensuring that we captured sub-saturation signalintensities for each of the spots on the microarray¹⁴. Microarrays werescanned using appropriate lasers and filter sets, essentially asdescribed previously¹⁴.

Microarray TIF images were quantified using GenePix Pro version 3.0software (Axon Instruments, Inc.). Specifically, background-subtractedmedian intensities were calculated using the median local background. Weused MASLINER™ (MicroArray Spot LINEar Regression) software to calculatethe relative signal intensities over the full series of laser power (orPMT gain) setting scans in a semi-automated fashion. Specifically,masliner combines the linear ranges of multiple scans from differentscanner sensitivity settings onto an extended linear scale^(14,43). Thisresulted in the dynamic range of the final PBM and SYBRGREEN™ I stainedmicroarrays to have fluorescence intensities that spanned 5 to 6 ordersof magnitude.

The resulting microarray data were filtered with a number of qualitycontrol criteria so that only data from high quality spots wereretained. First, for each of the triplicate microarrays, we removed datacorresponding to any flagged spots (e.g., spots that had dust flecks,etc.). Data from each of three triplicate microarrays were normalizedaccording to total signal intensity, so that the average spot intensitywas the same for all three slides. Then, within each individual slide,the data were separated into sectors, according to their local region onthe slide; for the whole-genome yeast intergenic arrays we sectored thespots into the 32 subgrids of the printed microarray. The data were thennormalized again so that the mean spot intensity was the same over allthe sectors; this served to normalize for any region-specificinhomogeneities in the background and also binding and labelingreactions. Any spots with SD/median greater than 2, e.g., spots withhighly variable pixel signal intensities, were filtered out. Thebackground-subtracted, normalized signal intensities for all spots withreliable data in at least two of the three replicate microarrays wereaveraged, and the SD/mean was calculated. The SYBRGREEN™ I microarraydata were treated exactly the same way, except that any spots with fewerthan 50% pixels with signal intensities greater than two standarddeviations beyond the median background signal intensity were alsofiltered out, as these spots presumably do not have enough DNA presentto allow accurate quantification of signal intensities. For the Rap1,Abf1, and Mig1 PBM datasets, ˜91-96% of 6723 unique spots passed thesecriteria.

We calculated the fractional signal intensity of each spot, relative tothe total signal intensity on the microarray. We then calculated thelog₂ ratio of the mean PBM signal intensity divided by the meanSYBRGREEN™ I signal intensity, and created a scatter plot of the logratio versus the spots' SYBRGREEN™ I signal intensities. Although weexpect that the log ratio should be independent of DNA concentration, wehave found that higher DNA concentrations, as determined by higherSYBRGREEN™ I signal intensities, appear to bind proportionately lessprotein. In order to restore the independence of log ratio andSYBRGREEN™ I intensity, the scatter plot was fit with a locally weightedleast squares regression using the LOWESS function⁴⁴ of the R statisticspackage⁴⁵ (smoothing parameter=0.5). We subtracted the value of theregression at each spot from its log ratio, yielding a modified logratio that is independent of DNA concentration. We then plotted thedistribution of all log ratios as a histogram (bin size=0.05), which forRap1, Abf1, and Mig1 resembled a Gaussian distribution with a heavytail. We determined the mode of the distribution by searching for thewindow of nine bins with the highest number of spots and taking themiddle bin. We then reflected all values less than the mode and fitthese values to a Gaussian function using the MATHEMATICA™ softwarepackage (Wolfram Research, Inc.). This gave the mean and standarddeviation of the distribution, and the mean was used to adjust the logratios so that the peak was centered on zero. We calculated a p-valuefor each individual spot based on the magnitude of its log ratiorelative to the standard deviation of the Gaussian distribution, usingthe normal error integral. In order to correct for multiple hypothesistesting, all individual p-values were adjusted to a modifiedsignificance level using the Modified Bonferroni Method^(15,46). Forsignificance testing of the PBM data, we used an initial α=0.001, whichcorresponded to α′ equal to approximately 1.5×10⁻⁷ for thehighest-ranking test case, as we were typically evaluating ˜6400 uniquespots.

DNA Motif Finding and Group Specificity Score

For analysis of sequences for over-represented DNA sequence motifs, weused BIOPROSPECTOR™²⁴. We chose BIOPROSPECTOR™ over other availablemotif finding programs because it proved to be the most inclusive inaccepting the largest number of input sequences in construction of theTF binding site motifs. To search for motifs that were over-representedin PBM experiments, we used all sequences from spots that had aBonferroni-corrected p-value less than or equal to 0.001 as input. Tosearch for motifs that were over-represented in the intergenic regionsbound in ChIP-chip experiments, we input either all sequences with ap-value less than or equal to 0.001⁸ or all sequences with a medianpercentile rank at or above 0.92 in the six replicate experiments andbelow 0.92 in controls⁷. For each set of input sequences, we performedseparate searches at each width between 6 and 18 nucleotides in order toidentify the highest scoring motifs at each width. We chose the singlemotif with the highest group specificity score²⁶ to be the mostsignificant, using the set of all sequences spotted on the microarray asthe background. Briefly, the group specificity score indicates thedegree to which the property of containing the sequence motif isspecific to the input set of intergenic regions, as determined from themost significantly bound spots on the microarrays, with a smaller groupspecificity score indicating that the motif is more specific to theinput set of spots (e.g., either the spots beyond a 0.001 p-valuethreshold in either the PBM or Lee et al.⁸ ChIP-chip data, or the spotsat or beyond the 92^(nd) percentile rank in the Lieb et al.⁷ ChIP-chipdata, or the randomly selected spots in the computational randomcontrols). In order to assess the statistical significance of the DNAsequence motifs resulting from analysis of the PBM experiments, weperformed a set of computational negative control experiments. In thesecomputational negative controls, we performed identical motif searcheson 10 individual sets of randomly selected spots from the yeastintergenic microarrays for each TF, with each random set containing thesame number of sequences as the original input sets for each of theRap1, Abf1, and Mig1 PBM datasets. The range of group specificity scoresfor the Rap1 control sets was 2.2×10⁻⁵ to 3.5×10⁻⁵, with a geometricmean equal to 8.4×10⁻⁸; the range of group specificity scores for theAbf1 control sets was 5.6×10⁻³ to 1.3×10⁻⁶, with a geometric mean equalto 3.7×10⁻⁵; and the range of group specificity scores for the Mig1control sets was 4.8×10⁻³ to 1.8×10⁻⁵, with a geometric mean equal to4.8×10⁻⁴. Thus, the Rap1, Abf1, and Mig1 motifs identified from theintergenic regions identified as bound in PBM experiments had highlysignificant group specificity scores as compared to the random controls.We also determined the motifs' Pearson correlation coefficients usingCompareACE²⁶. The correlation coefficients were as follows: Rap1 PBMversus Lee et al.⁸ ChIP-chip: 0.992; Rap1 PBM versus Lieb et al.⁷ChIP-chip: 0.995; Rap1 PBM versus TRANSFAC: 0.953; Rap1 Lee et al.⁸versus Lieb et al.⁷ ChIP-chip: 0.985; Rap1 Lee et al.⁸ ChIP-chip versusTRANSFAC: 0.921; Rap1 Lieb et al.⁷ ChIP-chip versus TRANSFAC: 0.950;Abf1 PBM versus ChIP-chip⁸: 0.989; Abf1 PBM versus TRANSFAC: 0.978; Abf1ChIP-chip⁸ versus TRANSFAC: 0.986; Mig1 PBM versus ChIP-chip⁸: 0.453;Mig1 PBM versus TRANSFAC: 0.938; Mig1 ChIP-chip⁸ versus TRANSFAC: 0.406.

Electrophoretic Mobility Shift Assays (EMSAs)

EMSAs were performed according to manufacturer's protocols for theLIGHTSHIFT® Chemiluminescent EMSA Kit (Pierce). Complementarybiotinylated DNA oligonucleotides, each 45 bp in length, weresynthesized (Integrated DNA Technologies) such that they contained thepredicted Rap I binding site, flanked by its native sequence from thegiven intergenic region. A positive control probe containing a knownRap1 binding site and a negative control probe lacking a Rap1 bindingsite were also synthesized and used in EMSAs.

Analysis of Functional Category Enrichment

Analysis of a group of genes for enrichment for a particular functionalannotation previously has been used to analyze sets of yeast genes thatcomprise particular gene expression clusters²⁵. We used the web-basedtool FunSpec for the statistical evaluation of the groups of genesdownstream of the ‘bound’ intergenic regions, for groups ofover-represented gene and protein categories with respect to existingfunctional category information from a number of public and publisheddatabases⁴⁷. Like the group specificity score described above^(25,26),FunSpec uses the hypergeometric distribution to calculate a p-value forfunctional category enrichment^(25,26).

Analysis of Cross-Species Sequence Conservation

We searched for conserved putative binding sites in the five sequencedgenomes of the yeast sensu stricto lade: S. cerevisiae, S. mikatae, S.kudriavzevii, S. bayanus, and S. paradoxus. Our searches were limited tothe aligned regions in the MULTIZ™ multiple sequence alignment. Regionsaligned between S. cerevisiae and each of the other four species wereseparately mapped onto the S. cerevisiae chromosomal coordinates. Weused ScanACE²⁶ to search all five genomes for sequence matches withintwo standard deviations of the motif identified from PBM experiments. Inour first approach, a site was called conserved if its ScanACE score waswithin two standard deviations of the motif average²⁸ that we determinedfrom the set of regions passing a 0.001 p-value threshold in PBMs and ifits relative position in each genome differed by no more than 15 bp. Inour second approach, a site was called exactly conserved if it satisfiedthe previous conditions and was identical in all five species at each ofthe informative positions. Here, for “exact” conservation we employed avery strict measure of sequence conservation, in which we required 100%sequence identity at all informative nucleotide positions of the bindingsite (9, 9, or 12 positions for Mig1, Abf1, and Rap1, respectively), inall five species. We defined informative positions to be those with aninformation content of greater than 0.5 bits in the PBM-derived motif.Analyses were performed with Perl scripts written by M.F.B.

Analysis of Correlation of Target Genes with Gene Expression Data

Gene expression data from 643 yeast expression microarray experimentsacross a variety of culture conditions⁴⁸ were normalized so that thefold-change within each microarray had a mean of 0 and standarddeviation of 1. In order to identify conditions under which a particularTF either activated or failed to repress transcription, we calculatedthe fraction of putative target ORFs with at least a 2.5-fold increasein gene expression for each individual condition. Similarly, to findconditions in which a TF acted as a repressor or failed to activatetranscription, we calculated the fraction of putative target ORFs withat least a 2.5-fold decrease in gene expression for each condition.Significance was assessed by comparison with 100 sets of randomizedORFs, matched in size to the lists of target genes for each TF.

REFERENCES

-   1. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E.    Sequencing and comparison of yeast species to identify genes and    regulatory elements. Nature 423, 241-254 (2003).-   2. Cliften, P. et al. Finding functional features in Saccharomyces    genomes by phylogenetic footprinting. Science 301, 71-76 (2003).-   3. Wodicka, L., Dong, H., Mittmann, M., Ho, M. H. & Lockhart, D. J.    Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat.    Biotechnol. 15, 1359-1367 (1997).-   4. DeRisi, J. L., Iyer, V. R. & Brown, P. O. Exploring the metabolic    and genetic control of gene expression on a genomic scale. Science    278, 680-686 (1997).-   5. Ren, B. et al. Genome-wide location and function of DNA binding    proteins. Science 290, 2306-2309 (2000).-   6. Iyer, V. R. et al. Genomic binding sites of the yeast cell-cycle    transcription factors SBF and MBF. Nature 409, 533-538 (2001).-   7. Lieb, J. D., Liu, X., Botstein, D. & Brown, P. O.    Promoter-specific binding of RapI revealed by genome-wide maps of    protein-DNA association. Nat. Genet. 28, 327-334 (2001).-   8. Lee, T. et al. Transcriptional regulatory networks in    Saccharomyces cerevisiae. Science 298, 799-804 (2002).-   9. Ito, T. et al. Toward a protein-protein interaction map of the    budding yeast: A comprehensive system to examine two-hybrid    interactions in all possible combinations between the yeast    proteins. Proc. Natl. Acad. Sci. USA 97, 1143-1147 (2000).-   10. Uetz, P. et al. A comprehensive analysis of protein-protein    interactions in Saccharomyces cerevisiae. Nature 403, 623-627    (2000).-   11. Zhu, H. et al. Global analysis of protein activities using    proteome chips. Science 26, 2101-2105 (2001).-   12. Ho, Y. et al. Systematic identification of protein complexes in    Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180-183    (2002).-   13. Oliphant, A. R., Brandl, C. J. & Struhl, K. Defining the    sequence specificity of DNA-binding proteins by selecting binding    sites from random-sequence oligonucleotides: analysis of yeast GCN4    protein. Mol. Cell. Biol. 9, 2944-2949 (1989).-   14. Bulyk, M. L., Huang, X., Choo, Y. & Church, G. M. Exploring the    DNA-binding specificities of zinc fingers with DNA microarrays.    Proc. Natl. Acad. Sci. USA 98, 7158-7163 (2001).-   15. Bulyk, M., Johnson, P. & Church, G. Nucleotides of transcription    factor binding sites exert interdependent effects on the binding    affinities of transcription factors. Nucleic Acids Res. 30,    1255-1261 (2002).-   16. Benos, P., Bulyk, M. & Stormo, G. Additivity in protein-DNA    interactions: how good an approximation is it? Nucleic Acids Res.    30, 4442-4451 (2002).-   17. Lee, M.-L., Bulyk, M., Whitmore, G. & Church, G. A statistical    model for investigating binding probabilities of DNA nucleotide    sequences using microarrays. Biometrics 58, 981-988 (2002).-   18. Udalova, I., Mott, R., Field, D. & Kwiatkowski, D. Quantitative    prediction of NF-kappa B DNA-protein interactions. Proc. Natl. Acad.    Sci. USA 99, 8167-8172 (2002).-   19. Schena, M., Shalon, D., Davis, R. W. & Brown, P. O. Quantitative    monitoring of gene expression patterns with a complementary DNA    microarray. Science 270, 467-470 (1995).-   20. Diffley, J. & Stillman, B. Purification of a yeast protein that    binds to origins of DNA replication and a transcriptional silencer.    Proc. Natl. Acad. Sci. USA 85, 2120-2124 (1988).-   21. McBroom, L. D. & Sadowski, P. D. DNA bending by Saccharomyces    cerevisiae ABF1 and its proteolytic fragments. J. Biol. Chem. 269,    16461-16468 (1994).-   22. Konig, P., Giraldo, R., Chapman, L. & Rhodes, D. The crystal    structure of the DNA-binding domain of yeast RAP1 in complex with    telomeric DNA. Cell 85, 125-136 (1996).-   23. Lutfiyya, L. L. et al. Characterization of three related glucose    repressors and genes they regulate in Saccharomyces cerevisiae.    Genetics 150, 1377-1391 (1998).-   24. Liu, X., Brutlag, D. & Liu, J. BioProspector: discovering    conserved DNA motifs in upstream regulatory regions of co-expressed    genes. Pac. Symp. Biocomput., 127-138 (2001).-   25. Tavazoie, S., Hughes, J., Campbell, M., Cho, R. & Church, G.    Systematic determination of genetic network architecture. Nat.    Genet. 22, 281-285 (1999).-   26. Hughes, J. D., Estep, P. W., Tavazoie, S. & Church, G. M.    Computational identification of cis-regulatory elements associated    with groups of functionally related genes in Saccharomyces    cerevisiae. J. Mol. Biol. 296, 1205-1214 (2000).-   27. Wingender, E. et al. TRANSFAC: an integrated system for gene    expression regulation. Nucleic Acids Res. 28, 316-319 (2000).-   28. Robison, K., McGuire, A. M. & Church, G. M. A comprehensive    library of DNA-binding site matrices for 55 proteins applied to the    complete Escherichia coli K-12 genome. J. Mol. Biol. 284, 241-254    (1998).-   29. Planta, R. J. Regulation of ribosome synthesis in yeast. Yeast    13, 1505-18 (1997).-   30. Beer, M. A. & Tavazoie, S. Predicting gene expression from    sequence. Cell 117, 185-98 (2004).-   31. Hughes, T. R. et al. Functional discovery via a compendium of    expression profiles. Cell 102, 109-26 (2000).-   32. Jiang, R. & Carlson, M. The Snf1 protein kinase and its    activating subunit, Snf4, interact with distinct domains of the    Sip1/Sip2/Gal83 component in the kinase complex. Mol. Cell. Biol.    17, 2099-106 (1997).-   33. Palecek, S. P., Parikh, A. S., Huh, J. H. & Kron, S. J.    Depression of Saccharomyces cerevisiae invasive growth on    non-glucose carbon sources requires the Snf1 kinase. Mol. Microbiol.    45, 453-69 (2002).-   34. Rae, F. K. et al. Analysis of complementary expression profiles    following WT1 induction versus repression reveals the    cholesterol/fatty acid synthetic pathways as a possible major target    of WT1. Oncogene 23, 3067-79 (2004).-   35. Gasch, A. P. et al. Genomic expression programs in the response    of yeast cells to environmental changes. Mol Biol Cell 11, 4241-57    (2000).-   36. Unnikrishnan, I., Miller, S., Meinke, M. & LaPorte, D. C.    Multiple positive and negative elements involved in the regulation    of expression of GSY1 in Saccharomyces cerevisiae. J Biol Chem 278,    26450-7 (2003).-   37. Hartemink, A., Gifford, D., Jaakkola, T. & Young, R. Combining    location and expression data for principled discovery of genetic    regulatory network models. Pac. Symp. Biocomput., 437-449 (2002).-   38. Adams, M. et al. The genome sequence of Drosophila melanogaster.    Science 287, 2185-2195 (2000).-   39. Venter, J. C. et al. The sequence of the human genome. Science    291, 1304-1351 (2001).-   40. Lander, E. S. et al. Initial sequencing and analysis of the    human genome. Nature 409, 860-921 (2001).-   41. Man, T. K. & Stormo, G. D. Non-independence of Mnt    repressor-operator interaction determined by a new quantitative    multiple fluorescence relative affinity (QuMFRA) assay. Nucleic    Acids Res. 29, 2471-2478 (2001).-   42. Desjarlais, J. R. & Berg, J. M. Toward rules relating zinc    finger protein sequences and DNA binding site preferences. Proc.    Natl. Acad. Sci. USA 89, 7345-7349 (1992).-   43. Dudley, A., Aach, J., Steffen, M. & Church, G. Measuring    absolute expression with microarrays with a calibrated reference    sample and an extended signal intensity range. Proc. Natl. Acad.    Sci. USA 99, 7554-7559 (2002).-   44. Cleveland, W. & Devlin, S. Locally weighted regression: An    approach to regression analysis by local fitting. J. American    Statistical Association 83, 596-610 (1988).-   45. Ihaka, R. & Gentleman, R. R: A language for data analysis and    graphics. J. Computational and Graphical Statistics 5, 299-314    (1996).-   46. Sokal, R. & Rohlf, R. Biometry: The Principles and Practice of    Statistics in Biological Research, (W. H. Freeman and Company, New    York, 1995).-   47. Robinson, M., Grigull, J., Mohammad, N. & Hughes, T. FunSpec: a    web-based cluster interpreter for yeast. BMC Bioinformatics 3, 35    (2002).-   48. Stuart, J. M., Segal, E., Koller, D. & Kim, S. K. A    gene-coexpression network for global discovery of conserved genetic    modules. Science 302, 249-55 (2003).-   49. Schneider, T. D. & Stephens, R. M. Sequence logos: a new way to    display consensus sequences. Nucleic Acids Res. 18, 6097-100 (1990).

A number of embodiments of the invention have been described.Nevertheless, it will be understood that various modifications may bemade without departing from the spirit and scope of the invention.Accordingly, other embodiments are within the scope of the followingclaims.

1. A collection of nucleic acid polymers, the collection comprising aplurality of polymers in which each polymer has a unique sequence,wherein (i) the polymers of the plurality collectively provide at least80% of all possible sites of length k, k being greater than 6, the sitesbeing composed of a set comprised of four canonical nucleotides; (ii)each nucleic acid polymer of the plurality has a length greater than 1.2k nucleotides, and (iii) the number of unique polymers in the pluralityis less than 0.5 times 80% of the number of all possible sites of lengthk.
 2. The collection of claim 1 wherein the polymers of the pluralitycollectively provide at least 98% of all possible sites of length k, andthe number of unique polymers in the plurality is less than 0.5 times98% of the number of all possible sites of length k.
 3. The collectionof claim 2 wherein the polymers of the plurality collectively provideall possible sites of length k, and the number of unique polymers in theplurality is less than 0.5 times the number of all possible sites oflength k.
 4. The collection of claim 1 wherein k is between 7 and
 10. 5.The collection of claim 1 wherein each of the represented sites oflength k is represented at least twice, each time in a differentcontext.
 6. The collection of claim 1 wherein each polymer of theplurality is physically associated with a substrate.
 7. The collectionof claim 6 wherein the substrate is a bead, and each polymer isphysically associated with a different bead.
 8. The collection of claim6 wherein the substrate is a planar array, and each polymer isphysically associated with a different address of the same array.
 9. Thecollection of claim 1 wherein each polymer of the plurality is insolution.
 10. The collection of claim 9 wherein the collection isdivided into a plurality of pools, each pool comprising a subset ofpolymers of the plurality.
 11. The collection of claim 1 wherein eachpolymer is doublestranded.
 12. The collection of claim 1 wherein eachpolymer is DNA.
 13. The collection of claim 1 wherein the compactionratio is less than 0.2.
 14. The collection of claim 1 wherein the set ofcanonical nucleotides consists of adenine, thymidine, guanine andcytosine.
 15. The collection of claim 1 wherein the set of canonicalnucleotides consists of adenine, uracil, guanine and cytosine.
 16. Thecollection of claim 1 wherein at least two of the sites are composed ofa set of nucleotides that comprises one non-naturally occurringnucleotide in addition to the four canonical nucleotides.
 17. A nucleicacid array comprising a plurality of addresses, each address of theplurality comprising a nucleic acid molecule wherein (i) the arraycomprises nucleic acid molecules, the molecules collectively providingat least 80% of all possible sites of length k, k being greater than 6,(ii) each nucleic acid molecule is associated with an address of theplurality and has a length greater than 1.5 k nucleotides, and (iii) thearray has fewer addresses than 0.5 times 80% the number of all possiblesites of length k.
 18. The array of claim 17 wherein the array comprisesa planar substrate.
 19. The array of claim 17 wherein the nucleic acidmolecules are double-stranded DNAs.
 20. A method of evaluatinginteraction specificity of a test compound, the method comprising:contacting the test compound to a plurality of nucleic acid molecules,wherein (i) the polymers of the plurality collectively provide at least80% of all possible sites of length k, k being greater than 6, the sitesbeing composed of a set of four canonical nucleotides; (ii) each nucleicacid polymer of the plurality has a length greater than 1.2 knucleotides, and (iii) the number of unique polymers in the plurality isless than 0.5 times 80% of the number of all possible sites of length k;and evaluating interactions between the test compound and one or asubset of nucleic acid molecules of the plurality.
 21. The method ofclaim 20 wherein the nucleic acid molecules are in immobilized on anarray and the test compound is in solution prior to the contacting, andthe step of evaluating comprises detecting the test compound on thearray.
 22. The method of claim 20 wherein the nucleic acid molecules arein immobilized on an array and the test compound is in solution prior tothe contacting, and the step of evaluating comprises detecting analteration in one or more of the nucleic acid molecules on the array.23. The method of claim 22 wherein the test compound comprises a nucleicacid-modifying enzyme.
 24. The method of claim 20 wherein the nucleicacid molecules are in solution and the test compound is immobilized on asubstrate, and the step of evaluating comprises washing the substrateand identifying nucleic acid molecules that are separated by thewashing.
 25. The method of claim 24 further comprising amplifyingnucleic acid molecules that are bound to the substrate.
 26. The methodof claim 20 wherein the test compound comprises a protein or nucleicacid.
 27. The method of claim 20 wherein the test compound has enzymaticactivity.
 28. The method of claim 19 wherein the test compound is amethylase, a nuclease, or a polymerase.
 29. The method of claim 20wherein the test compound has a molecular weight less than 2000 Daltons.30. The method of claim 20 further comprising formulating the testcompound as a pharmaceutical composition.
 31. The method of claim 30further comprising administering the test compound to a subject.
 32. Amethod of design a collection of nucleic acids, the method comprising:enumerating, in a string, all permutations of four identifiers in wordsof a preselected length k, wherein each permutation occurs a limitednumber of times in the string; segmenting the string into segments ofless than length of 100; and synthesizing oligonucleotides according tothe sequence of the identifiers in each of the segments.
 33. The methodof claim 32 wherein the step of enumerating comprises using linear shiftregister.
 34. The method of claim 32 wherein each word of thepreselected length occurs exactly once in the string.
 35. The method ofclaim 32 wherein k is between 6 and
 12. 36. The method of claim 32wherein the string is segmented into segments of less than a length of50.
 37. The method of claim 32 wherein the segments include overlappingends.
 38. The method of claim 32 further comprising disposing each ofthe oligonucleotides on a planar substrate.